AB719417 


.SKCtirin  fA  Ml-AiaiUAL  TKCllNlCVa.  RWORT 
(13  July  1970  12  Junumy  1971) 
von  THi:  PROJDCT 

cok.i*ii.j;r  DrsiGN  roii  the  iluac  iv 


I 


Mftsfchutfttt 

COMPUTfR  ASSOCIATES 

division  of 

APPUED  DATA  RESEARCH,  INC 


»iprt4yt>J  by 

NATIONAL  TECHNICAL 
INFORMATION  SERVICE 

IprinttlpM,  Va.  mil 


DISCLAIMER  NOTICE 


THIS  DOCUMENT  IS  THE  BEST 
QUALITY  AVAILABLE. 

COPY  FURNISHED  CONTAINED 
A  SIGNIFICANT  NUMBER  OF 
PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 


MASSACHUOCTTr:  COl^PUTEKi  ASigirj:  i/v  r  iXTS 

Aoivitiofior  APf-IJLD  C  \  R;!.'  ' 

lANrAIOC  Of  rice  MRK*  WArttlCLO.MA'^.'.ACHUSCTlS  omco  •  ici/  01*0 


SECOND  SIM  I- AI.'iiUAl  i  KC'lNiC/.i  PLlOKT 
(13  July  1970-  1?.  J^  nuajy  1971) 
ron  THE  PROJECT 

COMPILER  DESIGN  POH  THE  ILLTAC  IV 


Principal  Investigator  and  Project  Leader: 

Robert  E.  Millsteln  Phono  (617)  245-9540 

ARPA  Order  Number  ARPA  1554 
Program  Code  Number  0D3O 


Contractor:  Applied  Data  Research,  Inc. 

Contract  No.:  DAHC04  70  C  0023 


Effective  Date:  13  January  1970 

Expiration  Date:  12  October  1971 

Amount:  $303,012.50 

Sponsored  by 

Advanced  Research  Projects  Agency 
ARPA  Order  No.  1554 


AfillliAfiJ 


Tho  ILLIAC  rORTKAN  Compiler  may  be  characterized  on  e  zerlez 
of  transformations  on  tho  source  Input  stream.  rORTlAN  code  Is  trans¬ 
formed  into  a  rcprescnUilion  which  crcplolts  ILLIAC  parallelism.  This 
transformation  Is  accomplished  by  detecting  individual  statements  within 
DO  loops  which  may  be  executed  in  parallel  for  values  of  the  DO  Indices , 
and  determining  an  ordering  which  preserves  dt  ta  dependencies.  While 
tho  result  of  this  effort  offords  ILLIAC  parallelism.  It  Is  Insensitive  to 
two  major  characteristics  of  ILLIAC  hardware:  an  enonnous  disk  latency, 
and  an  ability  to  overlap  execution  of  sequential  and  parallel  components 
of  the  hardware.  In  order  to  fully  exploit  tho  capabilities  of  the  ILLIAC. 
two  more  transformations  are  effected.  First,  code  Is  restructured  to 
minimize  the  effect  of  disk  latency .  Second .  operations  are  allocated  to 
maximize  CU-PE  overlap.  At  this  stage  it  Is  appropriate  to  generate 
ILLIAC  code. 
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The  bdHic  architecture  of  the  ILLIAC  IV  Fortran  Compiler  may  bo 
characterized  as  a  series  of  transformations  (each  representing  a 
compilation  phase)  on  the  FORTRAN  source  input  string.  The  first 
phase  parses  the  input  stream  and  generates  a  flow  graph  in  which 
each  statement  is  represented  as  a  node. 

The  parsed  string  and  flow  graph  depict  an  execution  process 
which  has  been  coded  sequentially.  It  may  contain  operations  which 
exhibit  ILLIAC- exploitable  parallelism.  Detection  of  ILLIAC  exploit¬ 
able  parallelism  involves  detecting  individual  statements  within  DO 
loops  which  may  be  executed  in  parallel  for  values  of  the  DO  indices , 
and  determining  a  statement  ordering  which  permits  parallel  execution 
without  altering  data  dependencies.  Chapter  II.  which  examines 
parallelism  detection  is  primarily  concerned  with  data  dependency  and 
statement  ordering.  Appendix  A  provides  a  technique  for  detecting  data 
dependencies  in  an  arbitrarily  complex  flow  graph. 

By  means  of  the  parallelism  algorithm .  the  flow  graph  is  appropriately 
transformed.  The  parsed  input  stream  is  replaced  by  N-address  macros 
which  permit  symbolic  references  (i.e.,  ACI.JR-Z)  is  a  legal  address). 

At  this  point  is  is  feasible  to  examine  storage  requirements  with 
respect  to  I/O  demands .  Chapter  III  is  concerned  with  optimizing  I/O  for 
large  arrays .  If  this  procedure  dictates  a  new  statement  ordering ,  the 
appropriate  transformations  are  effected.  The  examination  of  FORTRAN 
statement  orderings  provided  an  insight  into  the  nature  of  partial  orderings. 
These  observations  are  contained  in  Appendix  B. 

After  reordering  in  order  to  minimize  I/O  latency,  array  reference 
macros  are  expanded  according  to  the  array  methodology  described  in  the 
First  Semi-Annual  Report,  The  resulting  pseudo  code  may  be  further  optimized 
to  take  advantage  of  CU-PE  execution  overlap.  Chapter  IV  defines 
ILLIAC  optimization  goals  with  respect  to  overlap,  and  provides  a  method 
for  allocating  Instructions  between  CU  and  a  single  PE  for  sequential  code. 
Appendix  C  specifies  the  upper  boundary  for  execution  savings  in  an  overlap 
optimization  effort. 

At  this  state  it  is  appropriate  to  generate  ILLIAC  IV  code , 


|1.  THC  DETECTION  OF  PAHAMELISlVi  IN  DQ.LQQ£S 


This  sootion  connists  of  two  distinct  parts.  The  first  of  those 
represents  an  overview  of  the  nature  and  aims  of  the  procedure  for 
detecting  parallelism  in  DO  loops  oiploltablo  on  the  ILIJAC  IV.  The 
second  is  concerned  with  a  particular  aspect  of  this  procedure:  It  is 
a  discussion  of  the  problem  of  data  dependency  doicction  in  certain  re** 
stricted  but  basic  situations  Involving  nested  DO  loops  and  multi- 
dimensioned  arrays. 

In  the  discussion  of  data  dependency  detection  in  both  the  present 
and  previous  semi-annual  reports,  relatively  simple  flow  logic  has  been 
assumed  for  clarity  of  explication.  In  actual  practice,  of  course,  such 
simplicity  is  not  likely  to  be  the  rule .  Hence ,  Included  in  Appendix  A  is 
the  basis  of  a  general  algorithm,  based  largely  on  p-graph  theory,  for 
detecting  data  dependency  in  loops  with  any  degree  of  control  complexity. 


PART  l!  AN  OVERVIEW 

The  goal  of  the  parallelism  procedure  is  the  examination  of 
FORTRAN  DO  loops  in  order  to  determine  how  much  of  the  calculation, 
if  any,  can  be  done  in  parallel  on  the  DO  index  variable.  Manipula¬ 
tion  of  code  will  be  performed  whenever  possible  to  maximize  the 
amount  of  parallel  computation.  The  analysis  to  accomplish  the  above 
will  necessarily  be  restricted  as  to  the  complexity  of  code  it  can  deal 
with;  that  is,  in  some  cases  (hopefully  a  small  percentage)  no  attempt 
will  be  made  to  manipulate  the  code  to  effect  parallelism . 

RESTRICTIONS 

1.  The  analysis  of  DO  loops  will  extend  to  detection  of  parallelism 
on  the  index  variable  along  2  dimensions  in  a  DO  nest.  In  a  DO  loop 
nested  3  or  more  levels  deep,  code  is  analyzed  for  parallelism  in 
the  innermost  DO  variable,  then  in  the  next  nesting  DO  loop  variable, 
etc.,  until  parallelism  along  2  dimensions  has  been  found.  Along  all 


othor  dimoiislons,  scquontKil  exocutlon  It  otstmod. 


Exam  pit: 


DO  I 
A(I)  = 


DO  K 
C  (l.K)  = 

DO  L 
D  d.K.L)  « 


The  L  loop  is  the  only  one  nested  3  or  more  deep.  The  code  in 
this  loop  is  analyzed  first  with  respect  to  the  L  variable,  then  K. 
then  I.  If  parallelism  is  found  along  the  L  and  K  dimensions,  say, 
then  with  respect  to  the  I  dimension  the  code  will  be  executed  sequen¬ 
tially.  Outside  the  L  loop,  code  is  examined  for  parallelism  along 
1  or  2  dimensions,  as  the  case  may  be. 


2.  At  present,  interrnl  (non-DO)  cycles  In  a  DO  loop  are  considered 
as  a  special  case:  a  cycle  will  be  examined  for  parallelism  (In  the 
DO  variable)  as  it  stands;  no  attempt  will  he  made  to  manipulate 

the  code  to  make  parallelism  possible.  In  relation  to  the  rest  of 
the  code  in  the  loop,  »  cycle  (or  nest  of  cycles)  will  be  treated  as 
a  "black  box"  containing  only  a  list  of  definitions  and  references. 

3.  Subscripted  variables.  The  procedure  is  based  heavily  on  the 
analysis  and  ordering  of  the  subscripts  of  array  variables ,  primarily 
for  the  determination  of  inter-loop  data  dependencies.  Only  cases 
having  subscripts  of  the  standard  linear  form  klf  c  where  I  is  the 
DO  variable  will  be  fully  analyzed.  Otherwise,  it  will  be  assumed 
that  nothing  is  known  about  the  relative  value  of  a  subscript  and  the 
"worst  possible  case"  is  assumed.  This  in  general  will  force  sequential 
execution  of  the  affected  code.  ^  possible  exception  to  the  above 
restriction  (since  it  appears  to  occur  commonly)  is  a  subscript  which  is 

a  linear  form  in  »  non-DO  variable  which  is,  however,  easily  detected 
to  be  a  linear  form  in  the  DO  variable , 


4.  So  far  as  the  analysis  described  here  is  concerned,  a  case  of 
"detected  pnr;illelism"  is  a  piece  of  code  for  which  it  has  been 
determined  that  the  associated  data  dependencies  are  such  that  each 
statement  can  be  executed  simultaneously  for  all  values  of  the  index 


voriAble.  No  roprosontatlon  It  ntAdo,  however,  ihtt  the  data  for 
the  oonputationt  oan  bo  pratanted  tlmultanocutly  to  the  proeatting 
alamantt.  That  it,  the  analytlt  lookt  for  ooncurranoy  of  operator 
axaoutlon,  but  not  opoiand  fetching.  In  tho  atiitamantt: 

A(I )  «  B(I)  +  C(I*) 

Z  (I)  -  Y(I)  +  FUNCT(I) 

It  may  be  pottlble  to  perfonn  tho  adJltlon  and  store  oparutlont  In 
parallel  (on  1).  However,  the  fetching  of  alcir.ontt  of  C  in  the  flrtt 
ttatemant  may  require  a  complex  sequence  of  instructions.  In  the 
tecond  statement,  calculation  of  the  function  KUNCT  might  require 
all  procettors;  the  values  FUNCT(I)  would  then  have  to  be  generated 
tequentially. 

PimmigM-CRQgEP.VRE 

fiMtiiog; 

1.  Determination  of  data  dapandenciet 

2 .  Stating  the  ordering  relations  for  parallel  execution 

Causal  chains 
Ranches  and  merges 
Cycles 

3.  Datermin^'tion  of  optimal  total  ordering  to  minimize  overwrite 

1.  Detennin«?tion  of  Data  Dependencies 

Assume  a  technique  equivalent  to  the  p-graph  algorithm  to  be 
applied  to  the  loop  code .  For  input  to  the  algorithm ,  a  "p-graph"  for 
each  variable  (or  uniquely  subscripted  variable)  is  represented;  the 
nodes  of  the  graph  correspond  generally  to  the  uses  and  definitions  of 
variables;  additional  nodes  for  merge  points  and  the  entry  and  exit 
points  are  supplied.  Application  of  the  algorithm  gives  all  data 
dependencies  for  hon-subscripted  variables,  that  is,  a  variable  use  is 
explicitly  related  to  one  or  more  "circled"  nodes  which  might  have 
generated  its  present  value.  (It  is  assumed  that  the  algorithm  keeps 
track  of  all  circled  nodes  associated  with  a  merge  node.) 


I'ur  uubi<r't:  '..?<l  vAri.iUlcn,  tho  AtouHthm  glvt«0  «U  tntra- 
looi<  <l«  the  (Jotern«ituUoit  o(  Intor-loop  clo|iondenelos  It 

more  i:>«inplttx.  II  the*  graph  lor  a  subscrlptod  vorkililo  tthem*  •  um 
of  li:>  Inlllal  valiu  (either  Ulroetly  or  via  e  merge),  tlicn  It  mutt  bo 
doU'calncd  if  ililo  value  cxiul  have  been  generated  In  a  provlout 
iteration  of  the  loop.  To  t*o  tliU,  tlie  graphn  for  the  some  eirny  vofleble 
ore  <i!i»lgned  Indicc  o  accortllng  to  deoeending  value  of  their  literal 
■uhet*ripta*.  For  example,  11  array  variable  A  appeara  In  a  loop  on  index  I 
with  nubscrlpis  1-1,  1,  HI,  then  tho  A(H1)  graph  if  aaaignad  1,  A(P  graph 
2,  etc.  To  find  a  poaaiblo  generation  in  a  previoua  loop  iteratkm  for  a 
■ubncripiod  variable  eae  with  IikIcx  n,  tho  grapha  ootreaponding  to 
indicos  n- 1,  n-2,  etc. ,  are  examined  in  turn.  The  firat  graph  enoountaced 
having  a  non-initial  value  at  ita  ox<t  node  givea  an  inter-loop  dopendenoy, 
1*0. ,  tho  value  waa  generated  at  the  nodefa)  generating  the  exit  value* 

If  the  rxKle  ia  a  merge  node  with  an  initiaJ  value  aa  input,  the  aeeroh 
for  intor-loop  dependenciea  la  oontinued,  halting  finally  when  only 
iK>n-lnitiai  valuea  ore  anoounterod,  or  vrhon  all  the  p-grapha  have  been 
examined.  It  ahould  be  noted  that  thie  technique  la  BMde  feaelblo  by  the 
obaervation  that  a*  Jiough  the  number  of  literal  aubeoript  expreaelona 
appearing  in  tho  aource  text  within  the  range  of  the  loop  la  ia  principle 
unlimited,  it  ia  in  fact  uaually  rather  amall* 

Subacflntino  Non-Subacriotad  Variuhlea 

Non-aubacripted  voriaUea  will  have  to  be  aubaoripted  in  acme  oaaea  la 
order  to  execute  atatementa  in  parallel 
DO  1 

1  C-  Afd/D 

2  8(0- C*  FUNCT(C) 

In  thla  example,  C  would  have  to  be  raplaoed  by  a  veetor,  any  C(0, 
in  order  to  execute  all  definitiona  of  C  ia  parallel  in  atatemeat  I* 

Both  referencea  to  C  in  atatement  2  wouUI  of  oourae  alao  be  r^daoed 
by  0(0 .  The  variable  D  on  the  other  hand  need  only  be  "broadcaat” 
aimultaneoualy  to  oU  procoaaora  alnce  ita  value  la  tho  aame  for  all  I. 

In  general ,  o  non  nubacripted  variable  (and  ita  dependenciea)  vrill  be 
aubscriptod  when  it  la  defined  in  on  aaaignment  atatement  vrhoae  right 


*Only  subscripts  of  standard  form  kHowiU  be  cxdored.  Other  aubscripta 
are  considered  ''indotenninatc** ,  l.c.,  poaalbly  having  any  index  value* 


kMd  tldt  it  (dlrtotty  or  indirooUy)  a  funoUon  of  ctoo  00  vorioblo. 

Tht  tubt€rlpllii9  will  ttkt  nvo  fonnt.  For  lotra-loop  dopondonciot, 
tt  tn  tho  tbovt  oiMiwpItt  tho  tubtcrlpt  tMlgnod  it  idem*  **411 
for  tht  dentition  and  all  ilt  leforonoet*  Tor  Inter-loop  dopendoncioa* 
that  it«  adita  the  value  for  a  uao  of  the  voriahlo  wa»  gonoraied 
In  the  previottt  ilofation— tor  exaiapio! 

DO  I 

1  BO)  •  C  •  rUNCTfO 
to-  A(lfl)/D 

the  ute  will  be  aaaionad  a  aubacript  value  1  Icaa  than  iU  dofinition. 
The  above  example  may  bo  raformuiatod: 

DO  1 

1  BO)  -  Ca-1)  *  PUNCT(C(M)) 

I  C0)-A(H1)/D 

Attumingi  the  firat  eloaient  of  the  veotor  C  inlUolisod  to  the  value  of  C 
before  entry  into  the  loop*  the  above  fonmilation  la  equivalent.  Of 
oourae  Jor  parallel  exeoutlon  of  thia  loop*  atatenentt  1  and  2  muat  bo 
levoraed  (tee  next  aeetloiO* 


2.  Orci(  I  ir.fj  HclitHons  for  PumIIcI  Execution 


The  condition  for  parallel  execution  is  that  in  inter-loop 
dcpendoncios,  all  value  goncnitlons  relevant  to  a  variable  use 
precede  th'ii  use.  To  find  out  if  it  is  possible  to  rearrange  the  code 
in  the  loop  to  meet  this  condition  and  at  the  same  time  preserve  the 
essential  data  dependencies,  a  set  of  precedence  relations  might 
bo  constructed  as  follows.  Assume  that  p-groph  nodes  have  been 
numbered  so  that  corresponding  nodes  on  different  graphs  have  the 
same  number.  Represent  eacli  data  dependency  by  the  relation: 
nl  <  n2  whore  n2  is  the  node  number  of  a  variable  use  and  nl  the 
node  where  its  value  was  generated.  (A  relation  for  each  possible 
generation,  if  more  than  one,  must  be  stated.)  If  the  total  sot  of 
relations  is  examined  and  found  consistent,  l.e..  If  there  are  no 
cycles ,  then  the  DO  loco  can  be  made  to  be  executed  In  parallol . 
There  are  easy  techniques  applicable  to  Boolean  precedence  matrices 
for  detecting  cycles  and  for  determining  total  orderings  from  the  given 
partial  ordering  [1,2,3]. 


If  cycles  are  found  in  the  ordering,  then  there  are  dependencies 
which  represent  real  causal  chains  in  the  DO  loop,  for  example: 

A(I)  -  A(I-l) 

which  must  be  executed  sequentially.  The  following  examples  illustrate 
another  causal  chain  and  a  similar  case  which  is  not  a  chain: 


DO  I 

1  X(I)-A(I-1) 

2  A  (I)  «  X(I)*2 


X:  1  <  2 
A:  2  <  1 


DO  I 

1  X(I)  -  A(I-l)  A:  2  <  1 

2  A(I)  ■  Y(I)  X,  Y:  no  dependencies 


In  general,  only  maximal  cycles  will  be  considered:  the  sequence 
of  operations  represented  by  each  cycle  wiil  have  to  be  executed 
sequentially  (in  the  original  order) .  However,  all  other  operations  in 
the  DO  loop  (if  any),  can  be  performed  in  parallel.  A  total  ordering  can 
be  determined  by  considering  each  cycle  a  single  node  and  restating  the 
ordering  restrictions  accordingly. 


Branches  and  Merges 

Rearrangement  of  code  must  necessarily  take  into  account  the  flow 
logic  of  the  loop.  If  the  loop  contains  no  cycles  -  only  branches 
and  merges  -  the  problem  is  simple.  Assume  that  in  general  all 
operations  are  associated  with  "mode  sets",  that  is,  with  data 
words  set  to  indicate  the  values  of  the  DO  variable  over  which  an 
operation  is  to  be  defined.  (If  the  loop  contained  only  "straight 
line"  code ,  all  mode  sets  would  conceptually  be  set  to  the  entire 
DO  index  range.)  Assume  that  at  execution  of  an  IF  statement  all 
relevant  mode  sets  are  set  appropriately.  Then  it  is  only  necessary 
to  add  to  the  data  dependency  ordering  relations  the  conditions 
that  setting  of  mode  sets  precede  the  operations  dependent  on  them . 


Example: 


DO  I 

1  B  (I)  =  A(I-l)  +  1 

2  IF  (I  >  5) 

(I >5)  ^  (IS  5) 

3  B(D  -  B(I)  -  1  A(D  •=  A(I)  +  1 

Assume  statement  2  sets  mode  set  1  for  I  >  5  and  mode  set  2  for  I  s  5. 
Statement  3  is  associated  with  mode  set  1 ,  statement  4  with  mode  set  2 . 
The  ordering  relations  are: 


A:  4  <  1 

B:  1  <  3 

Mode  sets:  2  <  3 

2<  4 


This  gives  the  total  ordering: 

2  <  4  <  1  <  3 


The  loop  is  therefore  executable  in  parallel  as  follows: 


2  IF  (I  >  5) 

Sets  mode  sets 

1  and  2 

4  A(I)  =  A(I)  +  1 
1  B(I)  =  A(I-l)  +  1 

3  B(I)  =  B(I)  -  1 


Mode  Sets 
All  I 


I  in  mode  set  2 
All  I 

I  in  mode  set  1 


Example: 


DO  I 

1  C(I)  =  A(I-l) 

2  IF  (I  >  5) 

(I  >5)  (Is  5) 

3  A(I)  -  D(I)  +  1  4  A(I)  =  C(I)  +  1 


Assume  mode  sets  as  before.  The  ordering  relations  are: 


A: 

4  < 

3  < 

C: 

1  < 

Mode 

2  < 

Sets: 

2  < 

Total  ordering: 


2  <  3  <  cycle  (1,4) 


The  cycle  forces  sequential  execution  of  the  pair  (1,4) ,  but  2  and 
3  can  be  executed  in  parallel: 


2  IF  (I  >  5) 

Sets  mode  sets 

1.2 

3  A(I)  -  D(I)  4  1 

Sequential  loop: 

;i  C(I)  -  A(I-l) 

\4  A(I)  -  C(I)  +  1 


Mode  Sets 
All  I 

I  in  mode  set  1 

All  I 

I  in  mode  set  2 


Cycles 

After  data  dependencies  have  been  determined,  internal  (nests  of) 
cycles  are  examined.  If  there  is  found  to  be  any  inter-loop  depen¬ 
dency  within  a  cycle,  the  loop  cannot  be  executed  in  parallel  (on  the 
DO  variable)  as  it  stands: 

DO  10  I 

I  =  1 

1  A(I)  =  A(I)  I-  1  <* 

B(I)=  B(l)  t  MI-1) 

r  =  j+  1 

IF  a  <  1)  GO  TO  1 
10  CONTINUE 


In  this  example,  A(I-l)  is  dependent  on  the  final  definition  of 
A(I) ,  that  is ,  the  value  at  completion  of  the  internal  cycle .  In 
this  case ,  the  cycle  could  be  split  to  permit  parallel  execution: 

DO  10  I 

I  -  1 

1  A(I)  =  A(I)  +  1 

J“I+  1 

IF  (J  <  I)  GO  TO  1 
I-  1 

lx  B(I)*  B(I)  +  A(I-l) 

J-I+1 

IF  a  <  1)  GO  TO  lx 
10  CONTINUE 


In  the  following  example  no  such  manipulation  is  possible  because  of 
the  B(I)  dependency  in  the  IF  statement: 

DO  10  I 
1  A(I)  =  A(I)  +  1 
B(I)- B(I)+ A(I-l) 

IF  (B(I)  <  C)  GO  TO  1 
10  CONTINUE 


At  present,  the  proposed  procedure  simply  declares  a  cycle  executable 
in  parallel  on  the  DO  variable  or  not,  depending  on  the  absence  or 
presence  of  inter-loop  dependencies.  Some  investigation  of  "cycle 
splitting"  for  the  general  case  shows  the  analysis  to  be  more  complex 
than  the  first  example  suggests . 


.1 


11. 


A  cycle  v/ill  be  represented  in  the  ordering  relations  by  a  single 
node  at  which  rdl  variable  uses  dependent  on  values  generated 
before  entry  to  the  cycle  and  all  variable  definitions  occurring 
in  the  cycle  are  associated.  The  latter  represent  the  final 
values  of  variables  on  exit  from  the  cycle.  It  may  be  that  examina¬ 
tion  of  all  the  ordering  relations  shows  a  cycle  node  to  be  in  a 
causal  chain  sequence.  In  this  case,  the  cycle  is  always  executed 
sequentially  on  the  DO  variable  in  its  place  within  the  sequence . 


3.  The  Optimal  Total  Ordorlnq;  Overwrite  Considerations 

Thus  far,  the  requirements  laid  down  for  ordering  a  DO 
loop  to  permit  parallel  execution  have  excluded  consideration  of 
overwrite,  that  is,  the  redefinition  of  a  variable  or  array  occurring 
before  all  uses  of  the  previous  definition  have  taken  place. 
Preventing  overwrite  is  considered  a  secondary  requirement  in 
ordering  the  loop  because,  if  need  be,  it  can  be  handled  by  using 
temporary  storage.  However,  this  incurs  a  cost  in  space  and,  in 
some  cases ,  also  in  time* . 

The  ordering  relations  discussed  in  the  last  section  define 
a  partial  ordering  of  the  operations  in  a  DO  loop  which  implies  some 
set  of  total  orderings.  The  problem  then  is  to  choose  the  ordering 
that  minimizes  overwrite  (according  to  some  criteria) .  In  cases 
where  there  are  relatively  few  orderings  to  choose  from,  any  ad  hoc 
solution  will  probably  suffice.  The  general  case,  however,  may 
involve  infinite  combinatorics  and  apparently  no  general  solution 
to  this  problem  has  been  found.  Heuristic  solutions  will  be  investi¬ 
gated  based  on  experience  with  the  I/O  latency  problem . 


*  Extra  code  may  be  needed  to  restore  values  to  permanent  arrays  on 
exit  from  the  DO  loop.  It  may  also  be  needed  when  overwrite  results 
from  re-ordering  the  original  definitions  in  a  loop,  causing  incorrect 
exit  values . 


PART  2:  DATA  DEPENDENCY  IN  NESTED  DO  LOOPS 


With  regard  to  the  general  problem  of  extending  the  present  methods 
of  analysis  lo  comprehend  cases  involving  multi-dimensioned  arrays  and 
nested  DO  loops:  Consider  a  restricted  situation  in  which  we  have  two 
tightly  nested  loops  ("lightly"  meaning  here  that  all  code  contained  within 
the  outer  loop  is  contained  within  the  inner),  the  outer  on  I  and  the  inner 
on  J  and  both  using  an  Increment  value  of  +  1,  and  references  to  a  two- 
dimensioned  array  A,  in  all  of  which  the  first  index  is  of  the  form  I+c, 
the  second  of  the  form  J+k,  where  c  and  k  are  integers.  Assume  that 
there  are  no  control  transfer  statements. 

For  any  given  statement  in  the  loops  containing  a  right-side  reference 
to  A,  say  statement  x,  we  wish  to  determine  whether  there  are  any  ordering 
constraints  on  x  for  Its  SIM  execution  on  1)  J,  2)  I,  and  3)  I  and  J  together. 

More  precisely,  we  are  interested  in  the  ordering  of  statements  necessary  in 
transforming  the  nest  of  simple  DO  loops  into  any  of  the  following  three  nests; 

1)  DO  SEQ  I/DO  SIM  J,  2)  DO  SIM  I/DO  SEQ  J,  and  3)  DO  SIM  I/DO  SIM  J. 

(For  the  present,  we  shall  not  be  concerned  with  overwrite  problems  -- 
"ordering  constraints"  in  this  context  will  refer  simply  to  those  relationships 
between  statements  necessary  in  order  that  the  values  of  an  array  be  generated 
before  they  are  used.)  The  following  discussion,  for  the  present  purpose 
of  clarity,  makes  no  use  of  the  p-graph  concept  and  terminology. 

The  search  for  ordering  constraints  involves,  as  before,  examining  the 
rest  of  the  loop  code  for  dependency  relations ,  to  see  if  the  values  required 
for  the  reference  to  a  in  x  are  generated  elsewhere  in  the  nest.  But  whereas 
in  the  case  of  single  loops  there  were  essentially  only  two  kinds  of  data 
dependency  relations,  referred  to  as  "  intra-loop"  and  "inter-loop"  dependencies, 
which  were  of  more  or  less  equal  significance  in  transforming  the  code  to  permit 
exploitation  of  ILLIAC-type  parallelism,  a  nest  of  two  loops  introduces  a 
great  deal  more  complexity.  It  is  no  longer  true,  for  example,  that  a  data 
dependency  relationship  necessarily  implies  an  ordering  constraint. 

Specifically,  there  ere  three  distinct  kinds  of  data  dependency  relations 
possible,  one  of  which  has  three  different  varieties;  each  of  this  total 
of  five  types  has  slightly  different  implications  for  the  transformation  of 
the  code;  and  one  reference  to  an  array  can  be  dependent  on  any  number  of 
other  statements  in  various  of  these  ways . 


The  most  straightforward  situation  is  one  involving  what  we  shall  call 
simply  an  *'  intra-loop"  dependency  relation,  which  is  precisely  analogous 
to  the  like-named  relation  in  single-loop  code.  For  example,  suppose  that 
the  following  statements  occur  in  a  nested  loop  of  the  sort  under  considera¬ 
tion; 

y  A(I,J)  =B(I,I)  +  C(J) 

X  D(I.J)  =  A(I,J)  I  1 

Assuming  that  no  other  statement  with  a  left-side  term  intervenes, 

every  value  used  by  the  icference  to  A  in  statement  x  is  generated  by 
statement  y  during  the  same  iteration  of  the  loop  code.  In  such  a  situation, 
all  that  is  necessary  (so  far  as  this  particular  reference  to  a  is  concerned) 
for  the  IX)  nest  to  be  transformed  into  any  of  the  three  nests  described 
above  is  that  statement  precede  statement  x. 

A  slightly  more  complicated  but  still  fairly  straightforward  situation  is 
one  involving  what  we  shall  call  an  "intra/inter-loop"  dependency  relation, 
that  is,  where  values  are  generated  and  used  within  a  single  iteration  of 
the  outer  loop  but  in  different  iterations  of  the  inner  loop;  for  example: 

y  A(I,J+1)  «  B(I,I)  +  cm 
X  D(I,J)  =  A(IJ) 

When,  say,  I=»l  and  J=l,  statement^  generates  a  value  for  A(l,2);  when  I 
is  incremented  and  the  code  executed  again,  statement  x  uses  this  value. 
More  generally,  if  we  represent  the  reference  to  A  in  statement  x  by  A(f j ,  f2) , 
then  this  kind  of  dependency  relation  can  occur  only  if  there  is  at  least 
one  statement  Y  with  left-side  term  A(gj,  g2)  such  that  gj=fj --otherwise 
there  could  be  no  interaction  between  the  two  for  a  single  value  of  I — and 
X  would  use  a  value  for  any  particular  element  of  A  before 
could  generate  one.  If  there  is  more  than  one  such  retorenco  to  A,  the 
particular  statement  generating  the  values  used  in  x  can  be  determined  by 
examining  those  references  with  respect  to  the  second  indices  alone  by  a 
procedure  essentially  identical  to  that  described  in  an  earliei  report  for 
singly-dimensioned  arrays  in  single  loops.  The  statement,  say  y,  thus 
located  must  precede  x  in  any  transformation  of  the  DO  nest  to  the  first  or 
third  type  of  the  SIM  nests  listed  above;  however,  for  the  second  type, 
involving  SIM  execution  on  I  alone,  this  dependency  relation  requires  no 
ordering  constraints,  since  the  sequential  execution  v.fith  respect  to  J  will 
automatically  ensure  that  generation  precede  use. 


The  situations  coiisidered  thus  far  are  not  in  principle  different 
from  those  encountered  in  the  discussion  of  singly-dimensioned  arrays 
in  single  loops.  At  this  point  in  the  analysis,  however,  the  consequences 
of  the  nestedness  begin  to  make  tiiemselves  felt.  Suppose,  for  example, 
that  a  reference  A(fj ,  f^)  in  statement  x  was  found  to  use  values  generated 
by  statement  y  with  loli-side  term  A(gj=fj,  in  an  earlier  iteration  of  the 
inner  loop  l)ut  during  the  seme  iteration  of  the  outer  loop — that  is,  an 
intra /inter-loop  dependency  exists  between  x  and  y..  As  in  the  case  of 
single  loops  ,  the  fact  that  the  dependency  is  inter-loop  with  respect  to  J 
implies  that  during  at  least  one  Iteration  of  the  J  loop  (specifically,  during 
iterations)  statement  x  will  use  an  "initial  value"  of  A;  but  whereas 
"initial"  in  the  former  case  meant  that  the  value  was  generated  prior  to 
entry  into  the  loop,  here  it  means  simply  that  the  value  was  generated 
before  the  present  execution  of  the  J  loop  was  initiated — it  may  or  may  not 
have  been  generated  by  a  statement  other  than  y;  during  an  earlier  execution 
of  the  J  loop,  that  is,  during  a  previous  iteration  of  the  I  loop. 

Even  if  an  Intra/lntcr-loop  dependency  is  discovered,  then,  the  search 
for  generating  statements  must  continue.  (This,  of  course,  is  obviously 
not  the  case  for  a  simple  intra-loop  dependency).  Only  statements  containing 
left-side  instances  of  A  with  first  indices  larger  than  that  of  the  reference 
to  A  in  X  arc  candidates;  those  with  smaller  first  Indices  clearly  could  not 
generate  values  used  in  x  in  earlier  iterations  of  the  I  loop.  Disregarding 
for  the  moment  the  conditions  imposed  by  the  existence  of  a  finite  test 
value  for  the  DO  loops ,  it  should  be  clear  that  ^  of  these  definition 
statements  will  generate  values  for  some  of  the  elements  of  the  array 
referenced  in  x  prior  to  that  reference — what  must  be  determined  is  which 
statement  is  the  last  to  do  so  for  any  given  element. 

If  two  left-side  terms  have  different  first  indices,  the  one  with  the 
larger  index  will  generate  a  value  for  a  p;*rticular  element  of  A  during  an 
earlier  iteration  of  the  outer  loop  than  the  other;  if  two  terms  have  identical 
first  indices  and  different  second  Indices,  the  one  with  the  larger  second 
index  will  generate  such  a  value  during  the  same  iteration  of  the  outer  loop 
but  during  an  earlier  iteration  of  the  inner  loop;  and  finally,  if  both  indices 
are  identical ,  the  original  ordering  of  the  statements  determines  the  priority 
of  value  generation.  Tentatively,  then,  the  statement  we  seek  would  be  a 
member  of  the  set  of  the  candidate  statements  with  the  lowest  first  index  in 
the  left-side  term,  and,  of  these,  the  one  with  the  smallest  second  index,  and. 
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if  there  are  more  than  one  of  those,  the  one  occurring  latest  in  the  loop  code. 

If  an  intra/inter-loop  dependency  had  previously  been  discovered, 
involving  a  definition  of  statement  then  of  course  a  further 

restriction  is  that  the  second  index  be  smaller  than  g2 — otherwise  statement 
y  overwrites  all  the  values  before  they  can  be  used  in  x.  If  the  statement 
selected  by  the  procedure  just  described  fails  to  meet  this  requirement, 
all  statements  with  left-side  Instances  of  A  with  the  same  first  index 
are  excluded  from  consideration  (since  they  too  would  necessarily  be 
overwritten  by  y)  and  the  search  begun  again  with  the  next  greater  first 
index.  Suppose  that,  eventually,  statement  z,  with  left-side  term  A(hj,  h2) 
is  selected.  If  h2  5  f2/  then  ^  will  generate  values  for  all  those  references  to  ^  in  x 
(after  the  first  (h^-f  j)  iterations  of  the  outer  loop)  for  which  y  falls  to 
generate  values.  If,  however,  h2  >  f2/  then  there  will  remain  at  least  one 
iteration  of  the  J  loop  (to  be  precise, (h2-f 2)  iterations,  again  after  the 
first  (hj-fj)  Iterations  of  the  outer  loop)  for  which  x  will  use  an  "initial 
value"  for  the  A  reference.  Consequently,  the  search  must  be  continued. 

Clearly  any  statements  with  a  left-side  reference  to  A  with  a  first  index 
equal  to  hj  are  excluded;  further,  there  is  now  a  restriction  on  the  second 
index,  namely  that  it  be  smaller  than  h2.  The  search  continues  in  this 
manner  and  terminates  in  one  of  two  ways;  either  no  A  definition  statements 
remain  for  consideration,  or  sufficient  generating  statements  have  been 
found  to  produce  the  greatest  possible  number  of  values  for  the  reference  to 
A  in  X. 

The  preceding  discussion,  however,  omitted  any  consideration  of  the 
consequences  of  particular  values  for  the  DO  loop  parameters ,  The  procedure 
described  above  must  be  modified  in  certain  ways  to  take  these  into  account. 
Consider,  to  begin  with,  the  initial  and  test  values  of  the  outer  loop; 
call  them  aj  and  pj.  The  value  Rj  =  -  aj  +  1  represents  the  number  of 

iterations  of  the  outer  loop  that  occur  during  a  single  execution.  If  the 
difference  between  the  first  index  of  a  left-side  Instance  of  A  and  f  j  is 
greater  than  or  equal  to  Rj,  then  any  interaction  between  the  two  references 
is  precluded;  the  search  described  above,  then,  ceases  when  such  statements 
are  the  only  ones  remaining  to  be  examined. 

Now  consider  the  corresponding  parameters  in  the  inner  loop,  aj,  pj, 
and  Rj  =  Pj  -  ttj  +  1.  These  have  an  analogous  effect;  no  statement  vdth  a 


left-aido  refercnoo  A(p,q)  cnn  gcnorntc  vnlucs  used  in  x  If  (q-f  )  ?•  Rj 
(in  both  inlcr-uiid  inlro/innor-loop  dcpendoncles) .  Additionally,  no 
statcinont  need  be  considered  if  (f2‘*Q)  ^  Rj«  The  former  case,  of  course, 
occurs  when  q  is  too  much  lurgc:r  than  f2»  the  latter  when  it  is  too  much 
smaller.  These  restrictions  inoy  bo  restated  as  limits  on  the  volue  of  q: 
it  must  be  smaller  than  an  upper  limit  Lj  «  Rj  +f2  and  larger  then  a  lower 
limit  Ljj  =f2-Rj.  Those  limits  are  modified  in  the  course  of  the  search  by 
the  discovery  of  generating  statements.  Suppose  a  statement  ^  with  left¬ 
side  term  A(g^,  is  determined  to  generate  values  used  In  x.  If  92*^2' 

^  will  generate  values  for  all  references  to  A  In  x  after  tho  first  (flj-fj) 
iterations  of  tho  outer  loop;  and  since  no  statement  remaining  to  bo  ex¬ 
amined  can  have  a  smaller  first  index,  tho  search  is  terminated.  If  9^2' 
(as  it  must  bo  in  an  intra/lnter-loop  dependency,  and  may  be  otherwise), 
then  Lj2  is  set  to  92*  since  any  remaining  statement  v/ith  a  second  index 
higher  than  g2  will  fall  to  generate  any  values  that  could  be  used  by  the 
reference  to  A  in  x  other  than  ones  that  will  be  overwritten  by  2*  92  ^  ^2 ' 

Lj^  is  set  to  92  for  similar  reasons. 

These  limiting  values  provide  one  of  the  direct  tests  for  terminating 
the  search  for  generating  statements:  If  (Lj2‘’^j[)  -  there  exist  IK> 

statements  remaining  to  be  examined  that  could  generate  values  used  by  the 
reference  to  A  in  x.  (The  satisfaction  of  this  condition  means ,  loosely 
speaking ,  that  two  generating  statements  have  been  found ,  one  with  a  left- 
term  second  index  smaller  than  £2 .  the  other  with  a  left-term  second  index 
larger  than  f2,  that  are  "close"  enough  to  "overlap"— that  is,  takes  no 
value  in  the  loop  that  is  not  taken  by  the  second  index  of  one  or  the  other 
of  the  two  generating  statements.) 

An  additional  consequence  of  the  finiteness  of  Rj  that  might  be  noted 
here  is  that  several  generating  statements  may  have  identical  first  indices; 
for  example: 

y  A(I+1,I)  »B(I.J)+Ca) 


z.  Ad+l,  J+2)=  D(I,J)  +  Ca) 


X  E(I,J) 


A(I  J+4)  +  1 


If  J  ranges  from  1  to  10,  then  will  genm«ito  vnlue&  for  ut»e  in  x  fr>i 
A(I,5)  to  A(1, 10),  and  z  for  a(I,11)  and  A(1, 12),  for  oil  I  except  Oj. 

For  any  given  first  index,  howovor,  there  is  at  most  oii<!  goneroting 
statement  with  a  second  index  equal  to  or  greater  than  . 

All  of  the  cases  dincusced  atutve  whore  x  is  found  tc  depend  on  n 
statement  with  a  larger  first  index  may  bo  thought  of  a.,  simply  "inlcr- 
loop"  dependencies.  ‘Ihe  three  possible  varieties  of  intcr-loop  dependency, 
however,  have  rather  different  in. plications  for  the  reordering  of  st<)to- 
ments  in  effecting  a  transformation  of  the  IXv  loops  into  a  DO  SIN'  nest. 

In  no  case  is  there  any  ordering  constraint  for  the  tranr formation  into  the  first 
type  of  nest,  involving  SIM  execution  on  J  alone,  since  sequential  execution 
¥dth  respect  to  1  will  ensure  that  generation  precede  use.  In  every  case 
the  generating  statement  must  precede  x  for  the  third  type  of  nest,  the 
8IM/S1M  nest.  It  is  for  the  second  type  of  nest,  involving  SIM  execution 
on  I  alone,  that  the  consequences  differ:  If  the  second  index  is  equal  to 
{2,  then  the  generating  statement  must  precede  sc;  if  it  is  greater  than  f2 
there  is  no  constraint;  while  if  it  less  than  f2,  SIV  execution  on  1  alone 
cannot  be  effected. 

The  following  flowchart  represents  e  precise  statement  of  the  procedure 
described  in  the  preceding  pages.  It  is  assumed  that  ell  N  left-side 
instances  of  ^  in  the  body  of  the  loop  have  been  located ,  and  that  they  have 
been  listed  in  a  table,  along  with  associated  statement  numbers  (which 
have  been  assigned  sequentially  in  the  original  order),  in  order  of  inenreosing 
indices,  the  second  irulex  varying  more  rapidly,  and,  where  pairs  of  indices 
are  Identical,  in  decreasing  order  of  statement  number.  In  the  flowchart, 
the  symbols  c  and  k  represent  the  first  and  second  index  constant  modifiers  of  the 
reference  to  A  in )(;  the  symbols  s_,  p^,  and  q_  represent,  respectively, 
the  statement  number  and  the  first  and  second  index  constant  modifiers 
of  the  m^  entry  in  the  table.  (Note  that  since  it  is  the  difference  between 
indices  that  determine  the  decisions  in  searching  for  data  dependency  re¬ 
lationships,  it  is  only  the  constant  modifiers  that  arc  significant.) 

An  entry  is  initially  examined  on  the  first  page  of  the  flowchart;  if  it  is 
potentially  involved  in  an  intra-loop  dependency,  it  is  accepted  or  rejected 
on  that  page;  if  it  is  potentially  involved  in  an  intra /inter-loop  dependency, 
it  is  tested  on  the  second  page;  and  if  it  is  potentially  involved  in  an  inter- 


loop  dopciifloncy,  ii  is  tf:stec]  on  tlie  third  pugo.  Hexagonal  boxes  contain 
the  results;  "s  P  X"  means  that  statement  s  must  precede  x  in  the 
tronsfonned  code,  for  hlM  execution  on  the  loop  variables  listed  at  the 
bottom  of  tlie  box  (v/liero  I+J  refers  to  SIl^  execution  on  both  variables 
together) . 

The  flowchart  should  not  be  taken  to  bo  more  than  a  precise 
summary  of  the  material  presented  In  the  test.  First  of  all,  it  is  obviously 
valid  only  for  nested  loops  of  the  restricted  sort  described  at  the  outset 
of  this  section.  Secondly,  the  flnvl  form  of  the  algorithm  for  detecting 
data  dependency  in  nested  loops  v/ill  probably  be  closer  In  spirit  to  the  one 
described  in  Appendix  h  for  single  loops.  Finally,  the  algorithm  inherent 
in  this  flowchart  does  not  necessarily  embody  the  basic  strategy  to  be 
pursued  in  searching  for  ILLIAC-exploltable  parallelism.  For  example, 
rather  than  simply  determining  all  possible  ordering  constraints  arising 
from  data  dependency  relationships,  it  might  be  better  to  search  for  only 
the  kinds  of  data  dependency  relationships  relevant  to  each  of  the  possible 
transformations  of  the  nest,  considered  sequentially  in  order  of  their 
predetermined  desirability  (which  is  related,  for  example,  to  the  ranges 
of  the  DO  variables) ,  halting  as  soon  as  a  particular  transformation  is 
determined  to  be  possible. 


reject  entries  with  first  index  <  c 


jtest  first  index  for 
potential  inter-looo 
'dependency;  if  found, 
'go  to  B  for  further 
itesting 
..J 


"1 


test  second  index  for 
potential  intra/inter- loon 
dependency;  if  found,  oo 
to  A  for  further  testing 


Iboth  indices  equal;  test 
I  for  intra-loop  dependency; 
I  if  found,  note  ordering 
.constraint  and  exit 
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fTI.  TI.T.IAC  I/O  OPTIN^iIZATION 


The  following  section  is  concerncfl  with  minimizing  I/O  requests 
for  large  arrays.  Because  of  disk  latency,  this  effort  is  essential 
to  effective  use  of  the  ILLIAC  IV.  The  scope  of  the  approach  is  limited 
to  cases  where  array  references  in  a  small  number  of  contiguous  state¬ 
ments  require  more  space  than  is  availcable  in  core.  Attempts  are  made 
to  re-order  these  statements  such  that  like  array  references  appear  in 
adjacent  statements . 

Results  have  been  disappointing  and  future  effort  will  be  directed 
at  leaving  statement  order  Intact  and  calling  I/O  as  early  as  possible. 

By  partitioning  arrays  according  to  which  program  nodes  reference  them 
(as  suggested  by  T.  C.  Lowe  [4]), and  then  reducing  the  graph  of  the  program 
until  partition  size  exceeds  core  size,  it  may  be  possible  to  locate 
essential  I/O  calls.  Once  it  is  determined  where  I/O  calls  are  essential 
and  what  arrays  these  calls  reference,  the  problem  of  relocating  them  is 
similar  to  removing  invariant  calculation  from  DO  loops,  a  technique  which 
already  exists  in  the  literature  [1] . 

I/O  ORIENTED  STATEK^ENT  PERMUTER 

I.  Purpose;  The  ratio  of  disk-seek  time  to  memory  time  is  approximately 
4  orders  of  magnitude  on  ILLIAC  IV.  For  this  reason  it  is  advantageous 

to  issue  I/O  calls  as  early  as  possible,  hopefully  minimizing  time  spent 
waiting  for  material  from  disk.  It  is  possible  (by  rearranging  statements 
while  preserving  data  dependencies)  to  maximize  the  amount  of  I/O  which 
can  be  backgrounded.  The  purpose  of  this  effort  has  been  to  examine 
procedures  for  accomplishing  such  a  rearrangement  that  will  entail  as 
little  cost  in  combinatorics  as  possible .  The  problem  may  be  stated 
formally  as:  Given  a  parlal  order  on  a  finite  set  and  a  cost  function  associated 
with  each  linear  order  on  the  set,  find  the  linear  order  of  minimal  cost  that 
is  consistent  with  the  partial  order. 

II.  Scope  After  the  rather  Intensive  work  described  below  had  been 
conducted  it  became  apparent  that  the  scope  of  this  effort  is  somewhat 
more  limited  than  had  been  anticipated.  The  specific  technique  described 
deals  with  I/O  problems  of  a  very  local  nature,  and  is  applicable  in  situations 
which  do  not  seem  to  arise  with  astonishing  frequency.  The  effort  did, 
however,  afford  tho  investigator  deep  insight  into  the  nature  of  FORTRAN 
code  and  the  possibi.Uties  for  its  deformation. 


HI.  The  Permutation  Gcnorator 


This  routine  has  two  parts.  Part  I  computes  a  partial  order  matrix, 
using  Warshall's  algorithm  [3]  to  obtain  the  transitive  closure  of  a  set 
of  relations  of  the  form  a  ,lt.  b,  where  a  and  b  are  statement  numbers. 
Part  II  generates  permutations  of  the  statements  consistent  with  the 
matrix  from  Part  I.  In  addition.  Part  II  is  loaded  with  switches  and 
heuristics  in  an  effort  to  find  as  sliort  a  path  as  possible  io  the  minimal 
permutation.  These  heuristics  will  be  described  in  secLion  VI, 

111.1  Part  I — the  matrix  generator 

The  data  dependencies  are  generated  as  follows; 

(note;  ,p.  is  our  partial  order  relation) 

L(a)=  variable  appearing  on  left  of  statement  a 
R(a)=  variables  appearing  on  right  of  statement  a 

a.p.b  iff; 

1)  (a  <  b)  and  (R  (a)  Pi  L(b)  /  0 ) ,  or 

2)  (a  <  b)  and  (L(»)P  R(b)  0),  or 

3)  (a  <  b)  and  (L(a)=L(b))and  ;5x  ^b  <  x  <  inff  yl  L(y)=L(b) 

y  >  b,  RJ!x)  P  L(b)  ^  0}  . 

Condition  (3)  is  actually  too  strong,  for  it  defines  not  a  ,p,  re¬ 
lation  but  an  anti- contiguity  relation.  If  (3)  is  met,  then  the  restriction 
on  a  is  that  it  may  not  appear  between  b  and  any  statement  whose  right 
side  references  the  "b"  activation  of  variable  L(b),  In  the  test  program, 
this  condition  was  overlooked  entirely,  but  it  has  had  little  effect  on  the 
results  and  in  no  way  invalidates  them. 

After  the  data  dependencies  are  computed,  V^?;rsholl's  algorithm  is 
applied  to  obtain  the  transitive  closure — an  upper  triangular  boolean 
matrix  whose  (i,J)th  entry  reflects  the  truth  value  of  the  statement  "i.p.J 

ni.2  The  basic  permutation  generator 

The  permutation  generator  is  driven  by  a  mask  matrix  and  auxiliary 
tables,  all  computed  from  the  closed  partial  order  matrix. 

111. 2.1  The  mask  matrix  (MSKl) 

This  is  a  boolean  n  x  n  m?>trlx,  where  n  is  the  number  of  statements 
being  pennuted.  Each  column  is  a  position  in  the  final  linear  order, 

The  (i,J)th  entry  is  0  if  a  placement  of  statement  i  in  position  j  is  legal, 


-1  otherwise.  The  initial  setting  of  the  matrix  Is  determined  by  the  NP 
and  NF  tables  .below. 

111. 2. 2  The  auxiliary  tables 

1.  IRES;  an  n-vector  which  contains  the  final  linear  order, 

2.  JROW,  JCOL:  n-vectors  containing  the  number  of  zeroes  in  a 

given  row  (column)  of  MSKT. 

3.  NP,  NF:  n-vectois  containing  the  number  of  .statements  which 
must  precede  (follow)  a  given  statement.  Row  I  of  MSKl  originally 
contains  NP(I)  -I's,  follov/cd  by  n-NP(I)-NF(I)  O's, followed  by  NF(I) 

-I's.  In  words,  a  statement  which  must  precede  (follow)  NF(I)  statements 
(NP(I)  statements)  cannot  appear  in  any  of  the  last  (first)  NF(I)  (NP(I))  positions. 

111. 2. 3  The  algorithm 

This  is  a  stack  algorithm  which  places  statement  after  statement 
into  IRES  until  either  all  statements  have  been  placed  or  an  inconsis¬ 
tency  has  been  detected.  Subsequent  v/ork  has  Indicated  that  there  may 
be  a  slightly  more  efficient  algorithm  (see  Appendix  B) 

1.  L*  1  (set stack  depth). 

2.  Restore  MSKl  and  tables  to  (L-1)  state  (Ostate  is  original  state), 

3.  LLfIORD(LL)  (statements  are  handled  in  a  specific  order,  dis¬ 
cussed  below). 

4.  IPERM(LL)=IPERM(LL-H)(move  to  next  permutation  at  this  level), 

5.  IPERM(LL)  .gt.JROW(LL)?  (NO,  go  to  7). 

6.  Yes,  done  with  this  level,  pop  stack 

1^1?  (YES,  done). 

1/=L-1 
go  to  2 

7.  Place  LL  in  IPERM(LL)th  open  position  in  LLth  row  of  MSKl. 

8.  Mask  out  row  and  column  of  MSKl  taken  (JCOL,JROW=l). 

9.  Mask  out  all  open  positions  bcfore(after)  taken  position  in 

rows  of  statements  which  must  follow(precede)  statement  LL. 

Also  adjust  JROW,  JCOL. 

10.  Is  any  JROW  or  JCOL  =  0?  (YES  means  as  inconsistency,  goto  2) 

11.  Is  any  JROW  or  JCOL-1?  (YES  means  we  have  a  forced  entry; 

NO,  goto  12) 

12.  Any  more  to  do?  (NO,  done;  YES,  push  stack:  L=I/M,  goto3). 


There  are  some  obvious  bookkeeping  details  which  have  been  omitted, 
but  in  essence  the  above  statement  of  the  algorithm  is  correct. 

We  mentioned  that  the  statements  are  done  in  a  specific  order. 

The  reason  for  this  is  that  it  is  desirable  to  minimize  the  number  of 
non-hits  (yeses  at  10).  To  do  this,  we  do  statements  in  increasing 
order  of  number  of  original  open  spaces  (O’s)  in  MSKl.  Thus  errors  ore 
less  likely  to  occur,  since  as  lestrlctions  (l.e,  the  number  of  rnasked-out 
rows  and  columns)  increase  v;o  are  dealing  with  elements  which  had  more 
open  spaces  to  start  with,  and  so  can  "stand"  to  have  some  crossed  out. 

IV.  The  cost  function 

The  cost  function  has  been  designed  to  be  the  simplest  non-lrlvial 
I/O  simulator  possible.  Given  a  permutation  it  computes  total  and 
critical  (i.e  spent  waiting)  I/O  time.  Effort  is  made  to  background 
I/O,  but  the  analysis  is  not  necessarily  the  most  sophisticated  possible, 

V.  The  raw  result. 

For  sets  of  statements  of  order  7-8  or  fewer,  assuming  that  the 
number  of  order  relations  is  not  impossibly  small,  it  is  feasible  to 
examine  every  linear  order  to  determine  the  one  with  minimal  cost . 

For  larger  sets  the  combinatoric  nature  of  the  problem  asserts  itself, 
and  heuristics  must  be  applied.  In  general  it  is  possible  to  apply 
heuristics  to  sets  of  order  12  or  lower,  so  larger  sets  are  chopped 
into  units  of  order  10-12. 

VI.  Heuristic  approaches 

1 .  Giving  up  after  a  certain  number  of  legal  permutations  have  been  found . 
This  method  assumes  that  if  an  approach  {l,e,  some  other  heuristic)  is 

good  it  will  generate  low-cost  orderings  quickly,  and  thus  if  no  such 
orderings  are  found  early  it  is  safe  to  quit.  This  method  is  used  in 
conjunction  with  method  2  below. 

2 ,  Generating  only  those  permutations  which  have  all  n  references 
to  at  least  one  variable  in  rH-1  contiguous  statements  (n  cont¬ 
iguous  statements  was  found  to  be  much  too  restrictive,  and  rH-2  or 
greater  is  too  lax).  This  approach  does  v/ell,  as  it  should.  The  reasoning 
behind  it  is  that  any  good  permutation  must  h?-.ve  at  least  one  array 
resident  in  core  for  a  while  (to  minimize  critical  I/O)  :;o  there  will  be 
more  time  for  backgrounding.  The  problems  with  this  approach  arc  first 


that  there  is  bound  to  be  some  duplication  of  effort  between  segments 
dealing  with  different  variables  (the  same  permutation  may  be  a  hit 
for  more  than  one  variable),  and  second  that  it  is  not  always  clear  which 
variables  should  be  made  contiguous  with  respect  to  references.  The 
first  problem  is  effectively  unavoidable,  but  does  not  appear  to  be 
too  serious,  for  v^fhen  it  arises  it  means  that  the  permutation  we  are 
looking  at  is  probably  good  (after  nil,  if  one  contiguity  is  good, 
how  bad  can  two  bo?),  and  v/e  may  find  some  better  ones  nearby,  as 
it  is  an  observed  fact  that  good  permutations  tend  to  cluster  (be 
generated  close  to  one  another). 

The  second  problem  is  more  difficult.  Generating  permutations  based 
on  each  variable  eats  up  much  of  the  time  advantage  of  the  method. 

With  all  variables  being  examined(the  test  program  uses  12  variables, 
though  it  is  wildly  unlikely  that  more  than  6  or  7  of  them  will  be 
arrangoable  in  the  required  contiguity,  and  the  program  recognizes 
such  cases  quickly) ,  this  method  produces  slightly  better  results  in 
slightly  shorter  time  than  does  the  raw  method  (no  heuristics  except 
to  give  up  after  finding, say,  1000  legal  permutations).  There  do  appear  to  be 
some  reasonable  heuristics  to  determine  which  4  or  5  variables  are 
likely  to  be  best,  but  these  have  not  been  looked  into  too  closely. 

The  cutoff  procedure  for  this  method  is  interesting,  for  it  has 
effected  huge  time  cuts  with  no  noticeable  loss  in  power.  Examination 
cf  permutations  based  on  a  variable  is  terminated  if  no  improvement  is  found 
in  the  first  25  legal  permutations ,  or  if  the  only  improvements  in  the  first 
hundred  were  in  the  first  ten,  or  if  125  permutations  are  examined  with  no 
improvement.  In  each  case  improvement  means  improvement  over  the 
previous  best  permutation,  where  the  first  permutation  is  the  original  linear 
order.  The  rationalizations  for  the  three  cutoff  points  are:  1)  good  variables 
are  good  early;  2)  variables  tend  to  be  characterized  by  the  first  few 
permutations  they  generate;  3)  No  cases  have  been  observed  where  sparse 
permutations  were  especially  good,  and  even  some  of  the  best  sparse  ones 
have  been  improved  on  by  other  variables, 

3.  Generating  only  a  very  small  class  of  permutations ,  but  doing 
a  fair  bit  of  analysis  :o  parametrize  the  class .  The  analysis  consists 
of  the  following  steps: 


1.  Create  a  table  of  all  references  to  all  variables  (i.o.  a 
list  of  statements  for  each  variable) . 

2.  Determine  which  pa^rs  of  statements  are  "cjood"  in  the  sense 
of  having  many  variables  in  common  and  also  being  able  to  be 
made  contiguous. 

3.  Require  as  many  disjoint  pairs  as  possible  to  be  contiguous. 

There  is  evidence  that  this  method  is  the  best  of  all,  effecting  a  time 
cut  of  from  1  to  2  orders  of  magnitude  over  (2),  and  v;ith  possibly 
better  results.  It  is  based  on  the  theory  that  instead  of  searching 
a  large  set  of  peimutations ,  we  will  attempt  to  generate  only  permu¬ 
tations  that  are  close  to  minimal.  It  seems  clear  tliat  the  statements 
paired  are  the  very  essence  of  a  minimal  permutation,  and  in  fact  may 
be  close  to  comprising  sets  of  necessary  and  sufficient  conditions  for 
a  permutation  to  be  minimal.  The  one  drawback  to  this  method  is  that 
unless  at  least  two  disjoint  pairs  are  found  (and  three  is  much  better) , 
there  is  not  enough  reduction  done  to  ensure  that  the  restricted  class 
of  permutations  is  small  enough,  for  th  ’ea  of  this  method  is  to  tost  all 
permutations  which  are  legal  by  the  above  parameters.  In  about  half  the 
cases  we  cannot  find  two  fairly  powerful  (two  or  more  variables  in  common) 
disjoint  pairs.  A  possible  hybrid  of  methods  (2)  and  (3),  would  seem  to 
solve  the  problem ,  but  this  has  not  been  tested , 

VII.  The  algorithms  were  implemented  on  a  PDP-IO  computer.  The 
percentage  reduction  is  from  original  orderings  of  random  statements. 

The  table  on  the  following  page  summarizes  the  results  of  these  efforts. 
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Percenlacio  Reduction 


Method 

(Avorane) 

tfAveraae  CU  Time) 

fSl 

8#ns  12 

All  pcrniutationa 

Maximum  (usually 
about  SO) 

2s^lS 

tilSO 

t  pro^rtional 

All  pormutn lions 
(with  cutoff  applied 
for  n>12) 

close  to  maximum 
10-15 

^t&15 

tm 

^180 

is  300 

Method  2 
(usually  about 

1/2  of  the  variables 
are  tested.  Analysis 
might  reduce  this 

1/4  to  1/3  for  significant 
saving) 

30-50 

w  «■  w 

tsSO 

per  variable 

Method  3 
(can  only  meet 
conditions  1/2  time) 

close  to  maximum 

15s  is  30 
(estimated) 

tmmmmm 

IV.  IIVI\C  OVKR1.Ai*  OPTlMlZATIQN_ 

The  following  section  i.n  concerned  with  oxecuUon  time  optliiii/ntlon. 
Because  of  the  unconvontionolity  of  the  ll.LT/.C,  techn.lciucs  are  intro' 
duced  by  describing  counter  cxamiilcs  to  usual  optimi^alion  methods. 

An  attempt  is  mode  to  define  o])ti;nizution  goals  and  the  limits  of  their 
effectiveness.  A  simple  opllmizutlon  cilgorithm  is  introduced,  and  n 
choracterization  of  tlic  optiini/ution  problcin  is  specified. 

A  CoiivcnUonnl  Approach  to  <tn  Unconvonllor.  > I  Machi;..- 

Because  the  number  of  PF's  In  ihe  Il.LIACJ  iV  is  Ur.iic  J  to  sixty- 
four,  the  EXTENDED  rOk'J’ltAN  Compllor  m'ips  ouch  SIK'  assignment 
statement  whose  .SIM  varicblo  Is  greiter  than  cixty-foui  into  a  control 
loop  end  an  assignment  of  sixty-four  values.  The  control  loop  Iterates 
the  assignment  until  it  is  executed  for  all  values  of  the  variable. 

More  than  one  SIM  assignment  statement  may  occur  v/iihin  a  DO  Sliv'! 
loop.  Since  each  statement  within  the  loop  is  completed  before  pro¬ 
ceeding  to  the  next,  an  identical  control  loop  is  generated  for  each 
assignment  statement.  Erom  a  conventional  optimization  point  of  view, 
the  repetition  of  identical  loops  is  time  consuming  and,  in  cases  where 
data  dependency  and  overwrite  considerations  do  not  interfere,  unnecessary. 
The  reduction  of  identical  control  loops  to  a  single  loop  encompassing 
all  the  acsigrunent  statements  in  a  DO  SIM  loop,  appears  to  be  an 
effective  optimization  technique. 

Because  of  overlap  between  CU  and  PE  cxccuticn,  tho  anticipated 
gain  in  execution  time  is  negligible.  Control  instructions  are  executed 
in  the  CU;  SIM  assignments  in  the  PC's.  Since  CU  and  PE  execution 
overlap,  unless,  in  a  given  SIM  assignment,  tho  CU  execution  time  Is 
greater  than  PE  execution  time,  the  elimination  of  CU  instructions  will 
not  change  combined  execution  time.  An  examination  of  the  timing  of  the 
instructions  we  anticipate  generating  for  SIM  assignment  statements  has 
shown  that,  except  in  tho  simplest  cases,  combining  control  loops  is  an 
ineffective  optimization  technique. 


Ilill'iUC-i  >  .'L'-.l  J.t  i^-lD 

In  ordcjr  to  account  for  execution  overlap  between  CU  and  PE 
proces£:incj,the  concept  of  code  dominance  has  been  developed.  We 
hypothesize  that  ILLIAG  code  may  be  broken  into  segments  in  which 
either  the  CU  or  Pll  instructions  take  longer  to  execute.  We  call  the 
processor  which  takes  longer  to  execute,  and  therefore,  determines 
the  execution  time  of  that  segment,  the  dominant  resource  for  that 
segment.  In  the  previous  example,  SIM  assignment  statements  are  PE 
dominant.  ILLIAC  optimization  efforts  must  be  directed  at  the  dominant 
resource . 

If  neither  processor  is  idle  over  a  portion  of  code,  then  execution 
is  balanced.  Since,  in  the  case  of  sequential  code,  it  may  be  possible 
to  execute  instructions  in  either  the  CU  or  a  single  PE,  reduction  in 
execution  time  may  be  achieved  by  reassigning  instructions  from  the 
dominant  to  the  idle  resource.  This  procedure  will  be  referred  to  as 
balancing  by  allocation. 

A  second  exainplc  of  ihe  unconventlonalltv  of  ILLIAC  optimization; 
Balancing  by  Relocation 

A  "machine  independent"  optimization  technique  which  has  been 
examined  in  the  literature  is  the  removal  of  invariant  calculation  from 
program  loops.  In  the  case  of  ILLIAC  code,  significant  reduction  in 
execution  time  may  be  achieved  by  moving  invariant  calculation  Into 
program  loops,  ^s8ume  that  a  programmer  has  coded  a  simple  assignment 
statement  of  the  form  A=Bf  C  followed  by  a  SIM  assignment  statement 
whose  SIM  variable  range  is  greater  than  sixty- four.  The  simple 
assignment  statement  is  inter-changeable;  that  is,  it  may  be  executed 
in  either  the  CU  or  a  single  PE.  The  SIM  assignment  statement  generates 
a  control  loop  and  an  assignment  of  sixty-four  values.  This  code 
is  PE  dominant.  Clearly,  execution  is  reduced  by  allocating  the  simple 
assignment  statement  to  the  CU  and  moving  the  CU  code  into  the  PE 
dominant  loop.  (We  assume  that  the  difference  between  PE  and  CU  time 
within  the  loop  is  greater  or  cq^ual  to  the  CU  time  necessary  to  execute 
AsDf  C) .  Moving  code  within  a  processor  to  a  segment  where  the  same 


processor  is  idle  will  be  referred  to  as  balancing  by  relocation. 

*** 

In  summary,  execution  overlap  requires  unconventional  optimization 
techniques;  allocation  and  relocation.  The  first  balances  by  allocating 
instnictions  between  processors.  The  second  balances  by  relocating 
instructions  within  a  processor. 

The  Indeterminancv  of  Code  Dominance 

Execution  is  a  dynamic  process.  If  Maxwell's  demon  were  available, 
then  we  could  identify,  at  each  moment  of  execution,  the  dominant  resource. 
Because  of  the  unavailability  of  such  a  device  we  would  like  to  ascribe 
the  condition  of  dominance  to  ILLIAC  code  rather  than  the  ILLIAC  processors. 
We  could  then  allocate  and  relocate  by  means  of  an  algorithm  which 
segments  the  code  such  that  each  segment  has  a  distinct  dominance. 
Unfortunately,  this  situation  does  not  obtain.  Each  transfer  from  a 
dominant  portion  of  code  carries  with  it  a  'surplus'  of  unexecuted  Instructions 
which  v411  effect  the  dominance  of  the  subsequent  portion  of  code  to  be 
executed . 

For  example,  a  program  block  is  coded  such  that  a  large  number  of 
instructions  are  interchangeable.  The  block's  entry  is  a  merge;  one 
side  of  the  merge  is  balanced,  the  other  is  PE  dominant.  If  execution 
proceeds  from  the  PE  dominant  branch  of  the  merge,  then  the  block  is 
optimized  by  making  it  CU  dominant.  If  execution  proceeds  from  the 
balanced  branch  of  the  merge,  then  the  block  is  optimized  by  balancing 
it»  Resource  dominance  is  both  a  function  of  the  code  being  executed  and 
the  preceeding  code  awaiting  execution.  This  condition  is  somewhat 
ameliorated  by  the  ILLIAC  overlap  design,  which  only  queues  PE  Instructions. 
Consequently,  an  unexecuted  surplus  can  only  occur  in  the  case  of  PE 
dominance.  Local  balancing  is  a  reasonable  optimization  goal  in  tho  sense 
that  it  reduces  execution  time  in  comparison  with  executing  all  interchangeable 
instructions  in  a  single  PE.  From  a  global  point  of  view,  a  knov, ’ledge  of  'the 
most  probable  path  of  execution'  can  make  optimization  efforts  more 
effective. 


Conventional  optimisation  techniques  implicitly  assume  that 
the  fewer  instructions  executed,  the  more  optimal  the  code.  Techniques 
which  account  for  overlap  do  not  obey  this  optimisation  rule  ,  Nor 
can  this  rule  be  replaced  by  a  local  baUmcing  rule.  Unfortunately,  as 
this  example  has  shown,  it  is  not  sufficient  to  say  that  locally  balanced 
code  is  optimal  code. 

A  Restriction  on  Balancing  IbLIAC  Code 

A  simplifying  assumption,  namely  that  transmission  time  between 
the  two  processors  is  negligible,  must  be  abandoned,  A  timing  asymmetry 
of  significant  proportions  substantially  effects  optimization  efforts.  In 
general,  a  load  from  a  single  PE  to  the  CU  takes  tv/elve  times  as  long 
as  a  load  from  tlie  CU  to  a  PE.  There  are  two  ways  of  approaching  this 
asymmetry. 

The  first  is  to  restrict  the  allocation  of  interchangeable  instructions 
such  that  PE  to  CU  dependencies  (i.e. ,  an  operation  in  the  CU  utilizes 
an  operand  in  a  PE)  do  not  occur.  This  is  the  approach  utilized  in  our 
allocation  algorithm . 

A  second  approach  permits  PE  to  CU  dependencies,  but  establishes 
some  minimum  number  of  contiguous  CU  instructions  which  must  follow 
the  dependency.  Our  rationale  is  that  the  time  necessary  to  load  a  CU 
from  a  PE  can  be  averaged  into  the  overall  cost  of  executing  that  portion 
of  code  in  the  CU.  In  the  following  section  it  is  assumed  that  inter¬ 
processor  latency  has  been  accounted  for. 


Estimating  Optimization  Effectiveness 

Assuming  that  ILLIAC  code  can  be  balanced,  it  is  possible  to 
determine  the  upper  boundary  of  the  optimization  effectiveness.  That 
the  code  can  be  balanced  implies  that  interchangeable  instructions 
(in  this  case,  sequential  code  composed  of  arithmetic  statements 
Involving  integer  addition  and  subtraction)  are  available. 

Our  approach  is  to  assume  that  a  balanced  program  exists,  move 
the  CU  interchangeable  instructions  to  a  single  PE ,  and  compare  execution 
time.  (See  Appendix  C).  In  brief,  allocating  and  relocating  are  equally 
effective,  but  CU  storage  optimization  (i.e.  the  utilization  of  local 
memory  for  CU  operands)  is  essential  to  the  balancing  effort.  The  upper 
bound  for  a  CU  optimized  balancing  effort  is  33  per  cent  reduction  in 


execution  time;  If  CU  memory  is  not  optimized,  then  the  boundary  is 
15  per  cent. 

An  Inexpensive  Algorithm 

The  optimization  algorithm  we  propose  is  inexpensive .  Its 
advantages  are  that  it  achieves  v;hatever  optimization  is  easily 
attainable  with  minimal  effort.  V^hile  only  applicable  to  sequential 
code,  we  suspect  the  approach  might  be  extended  to  encompass  control 
loops  In  SIM  assignments. 

The  algorithm  is  based  on  the  following  observations  about  partitions 
of  macros  generated  by  sequential  FORTRAN  code.  Partition  sequential 
macros  Into  subsets  which  have  no  data  dependencies  v/ith  respect  to 
each  other.  Members  of  the  subsets  are  linearly  ordered.  We  observe 
that  the  execution  of  any  two  subsets  may  overlap.  Ideally,  all  the 
instructions  in  one  subset  would  be  allocated  to  the  CU,  and  all  the 
Instructions  In  the  other  subset  would  be  allocated  to  a  PE.  In  reality, 
the  CU  instruction  set  Is  so  limited  that  in  many  cases,  only  a  portion 
of  the  macros  in  a  subset  may  be  executed  in  the  CU.  We,  therefore, 
make  the  following  allocation  restrictions. 

A  subset  may  be  entirely  allocated  to  the  CU.  A  subset  may  be 
entirely  allocated  to  a  PE,  A  subset  may  be  allocated  such  that  execution 
begins  in  the  CU  and  terminates  in  a  PE,  in  which  case  the  subset  will 
have  a  single  CU  to  PE  dependency . 

Observe  that  if  we  allocate  according  to  these  rules,  then  there  will 
be  no  PE  to  CU  dependencies.  Now,  for  any  subset,  refer  to  those  macros 
allocated  to  the  CU  as  the  CU  portion  of  that  subset,  and  the  macros 
allocated  to  a  PE  as  the  PE  portion  cf  that  subset.  A  subset  may  have  a 
CU  portion ,  a  PE  portion ,  or  both  s 

Observe  that  for  any  two  subsets,  the  execution  of  a  CU  portion 
and  a  PE  portion  may  overlap.  The  objective  of  the  allocation  algorithm 
is  to  execute  the  PE  portion  of  the  nth  subset,  while  executing  the  CU 
portion  of  the  n+l  subset. 

We  now  introduce  timing  considerations.  Although  the  subsets  may 
be  executed  In  any  order,  it  is  desirable  to  avoid  the  following  condition: 
the  execution  of  the  PE  portion  of  a  subset  is  delayed  because  the  execution 
of  the  CU  portion  of  that  same  subset  is  not  cornplele . 


We  tiierofore,  introduce  the  following  ordering  restrictions.  For 
each  portion  of  a  subset,  compute  an  execution  time  estimate  for  the 
respective  processor.  Order  the  subsets  such  that  for  each  CU  to 
PE  dependency,  the  sum  (over  all  the  previous  subsets)  of  the  CU  time 
is  less  thvin  the  sum  of  liic  PE  time.  The  resulting  ordering  is:  Subsets 
with  PE  portions  alone  first.  Subsets  with  CU  portions  alone  last. 
Subsets  with  both  portions  ascending  from  maximal  PE  and  minimal  CU 
to  minimal  PE  and  maximal  CU.  A  brief  description  of  the  algorithm 
follov/s . 

The  subsets  correspond  to  FORTRAN  statements  which  have  no 
data  dependencies  with  respect  to  one  another.  Apply  a  method  proposed 
by  Ramamoorthy  [2]  to  partitian  sequential  statements.  The  macros 
generated  by  these  statements  coirespond  to  the  linearly  ordered  macros 
which  are  members  of  the  subsets . 

Apply  the  following  allocation  rule:  Assign  the  macros  in  a  statement 
to  the  CU  until  an  PE  dependent  macro  (i.e. ,  a  multiply)  is  encountered. 
Assign  that  instruction  and  the  remainder  of  that  statement  to  a  single  PE. 
Attach  an  execution  time  estimate  to  the  portions  of  each  statement. 
Assign  an  index  to  each  statement  according  to  the  difference  between  CU 
and  PE  estimates.  Order  the  statements  according  to  the  value  of  the 
index,  smallest  values  first.  The  resulting  oidcr  minimi::os  PE  idle  time. 

In  order  to  take  advantage  of  hardware  buffering,  care  must  be 
taken  when  issuing  cede  to  Interleave  CU  and  PE  instructions, 

A  Characterization  of  a  Block  Optimizer  for  EXTENDED  FORTRAN 

The  following  characterization  is  for  a  block  optimizer.  In  the 
case  of  EXTENDED  FORTRAN  we  define  a  block  as  a  set  of  statements 
in  sequential  order  having  one  entry  and  multiple  exits .  Since  each 
statement  in  a  DO  SIM  loop  is  executed  in  sequential  order,  a  block 
may  contain  SIM  assignment  statements,  each  with  an  identical  control 
loop.  For  our  present  purposes,  the  cyclical  nature  of  the  SIM  control 
loop  will  not  explicitly  appear  except  as  a  marking. 

A  block  of  ILLIAC  code  may  be  characterized  as  a  marked  tree, 
with  nodes  and  edges  corresponding  to  operations  and  operands  re¬ 
spectively.  Depending  on  the  character  of  the  operation  it  represents, 
a  node  is  marked  PE  dependent,  CU  dependent,  or  interchangeable. 

In  addition,  the  nodes  which,  in  the  actual  code,  are  nested  in  SIM 
control  loops  arc  identified  as  iterated  nodes. 


Allocation  may  be  characterized  as  a  reduction  procedure  applied 
to  the  tree.  The  objective  of  such  a  procedure  is  a  tree  of  CU  and  Pi; 
nodes.  While  in  the  previous  algorithm,  the  first  PE  node  encountered 
consigned  the  remainder  of  the  statement  to  a  PE,  a  more  extensive 
examination  might  reveal  that  a  large  number  of  interchangeable  instructions 
warrant  returning  calculations  to  the  CU.  Consequently,  the  first  re¬ 
duction  combines  interchangeable  nodes  and  assigns  CU  timing  estimates 
to  them.  The  second  reduction  begins  in  a  CU  node  and  combines 
CU  and  Interchangeable  nodes  until  a  PE  node  is  encountered.  PE  nodes 
are  combined  until  an  Interchangeable  node  is  encountered .  If  the 
CU  time  estimate  for  the  interchangeable  node  is  above  some  minimum 
(this  is  a  function  of  PE  to  CU  latency  and  is  unknown  at  this  point), 
then  assign  a  new  CU  node  and  continue.  Otherv>?ise,  combine  the 
interchangeable  node  with  the  PE  node.  Continue  this  reduction  until 
the  tree  contains  no  Interchangeable  nodes. 

Relocation  may  be  characterized  as  a  dofonnation  of  the  reduced 
tree.  While  in  the  previous  algorithm,  the  ordering  restrictions 
were  quite  simple,  in  the  present  case,  the  determination  of  orderings 
appears  to  be  computationally  explosive.  The  nodes  must  be  ordered 
such  that  data  dependencies  are  preserved  and  that  timing  order 
correlates  to  logical  order.  In  other  words,  for  each  CU-PE  dependency, 
the  sum  of  CU  time  is  less  than  the  sum  of  PE  time;  for  each  PE-CU 
dependency,  the  converse.  In  addition,  care  must  be  taken  to  keep 
iterated  nodes  clustered.  A  further  restriction  is  that  only  invariant 
calculations  may  be  moved  into  Interated  clusters . 

We  suspect  that  an  extension  of  the  first  algorithm,  i.e.  partitioning 
statements  before  reducing  the  graph,will  prove  to  be  the  most  practical 
optimization  approach  to  ILLIAC  code . 
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AP •  .’1’ NDIX  A:  A} ,CC )!UTHM  FOR  DCTERKMNING  DATA  DEPENDENCIES 

This  algoiilhni  is  based  primarilv'  on  the  p-graph  material  presented 
in  Shapiro  and  Saini'rj  The  KcprcisentaHon  of  Algorithms  [5  ].  The  final 
section,  on  subr.oiiptcd  variable  dcpendoncios ,  extends  the  analysis  to 
include  1^0  loops  loforencinn  .singly-dirncmsloned  arrays,  which  of  course 
potentially  embody  parallelism  exploitable  on  the  ILLIAC  IV, 

The  main  sections  of  the  algorithm  are: 

1)  Analysis  of  the  flow  logic  and  accumulation 
of  variable  use  statistics 

2)  Completion  of  p-graphs,  by  flow  block  and  by  node 

3)  Detection  of  Intcr-DO  loop  dependencies  for 
subscripted  variables. 

DETERNUNATION  OP  PLOW  BT.OCKS 

One  scan  is  made  over  the  code  to  determine  the  basic  flow 
blocks  and, at  tlie  same  time,  record  all  variable  uses  within  each  flow 
block.  It  Is  initially  assumed  that  statements  having  labels  are 
referenced  elsewhere,  that  Is,  are  the  start  of  a  new  flow  block,  in 
order  to  eliminate  an  extra  scan.  Information  is  also  recorded  in  terms 
of  "nodes"  to  provide  the  skeleton  for  the  final  p-graphs:  statements  are 
broken  down  Into  one  or  more  nodes  according  to  type  and  additional 
nodes  are  assigned  to  the  entry  and  exit  points  of  each  flow  block. 

The  scan  records  flow  data  In  a  Flov/  Block  Table  having  one  entry 
per  flow  block ,  Entry  format: 

I  FB#{  NO  DEN  I  NODEX  f  LABEL  !  DEF  I  REF 

FB#:  flow  block  number 

NODEN:  entry  node  number 

NODEX:  exit  node  number 

LABEL:  label  of  first  statement  in  block  (if  any) 

DEF:  defined  flog  =  ON  if  LABEL  is  the  label  of  some 

statement;  OFF,  otherwise, 

REF:  referenced  pointer  pointing  to  the  chain  of  flow 

blocks  which  reference  (transfer) to  this  flow 
block.  Zc!o  if  none. 

A  work  area  WA  is  used  to  sioru  the  REF  chains. 
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Procedure: 


Initialize; 


e- 


© 


CF (Control  Flag)  =  ON 
FTF(Fall  Througli  Flag)  =  ON 
CFB  (Current  Flow  Block)  =  0 

NODE  0 

-  .  -  _ 

Is  stateinorit  Y  . 
labelled?  ^ 

^  N  / 1"^  N 

CF=  ON?  ; 

Y  I 

N  iERROR' 
FTF  =  ON?  - ?• 

Make  new  entry 
In  Flow  Table: 

LABEL  =  label  (if  any) 

— . 1. 

DEF  =  ON 
CFB  =  CFB  +  1 
FB#  :=  CFB 

NODEX  (CFB-1)  ==  NODE+1 
NODEN(CFB)  •-*  NODE+2 
NODE  =  NODE+2 


Search  Flow 
Table  for  —  N. 
label 

Y  1/  Y 

DEF  ON?  - 

N  J, 


(B  ' 


FTF  =  ^N?  -  - 

Y 

Make  entry®  (CFB-1) 
in  WA  and  chain 
via  REF  (CFB) 

. . r . 

Test  current  statement 
and  reset  flags; 


CF=  ON  If  control  statement 
FTF  =  ON  If  fall  through  to 
.  next  statement 


- ^E^,, 


CF=  ON - 


V 

j'iir  O'Tch  desiiviation  label; 


SloiHch  Flow  Toblo  for 
Icibel,  making  new  entry 
if  not  there 

chain  CFB  into  REF 
of  entry 

; . i . ^ 

Variable  \ 

Use 

^Qitia _ 


Suuement 


The  flow  block  part  of  the  procedure  is  completed  by  the  construction  of 
the  I  low  Block  Connectivity  Matrix  FROM  from  the  completed  Flow  Block 
Table.  This  is  a  2-dimensional  Boolean  matrix  giving  the  direct  connections 
between  flow  blocks:  FBCM(I  J)  =  1  if  flow  block  I  transfers  to  flow  block  J. 

To  construct  the  matrix,  the  Flow  Table  entries  arc  cycled.  For  each  entry, 

J  =  FB#  and  I  =  REF(J)  chain  flow  block  numbers.  FBCM  (I,J)  «=  1  for  all 
such  I,J  pairs. 

The  following  additional  errors  may  be  found: 

1)  J  =  0  =>  statement  label  referenced  but  not  defined 

2)  REF=  0  =>  Icabeled  statement  not  accessible 

3)  FBCM  (I  J)  =  1  and  there  is  no  other  non-zero  element 

in  row  I  or  column  J  =>  flow  block  J  can  be 
appended  to  flow  block  I. 

This  is  the  case  of  generation  of  an  extraneous  flow 
block.  It  probably  does  not  pay  to  fix  it. 

Artificial  entry  and  exit  blocks  are  added  to  the  set  of  flow  blocks.  The 
entry  block,  numbered  0,  precedes  all  original  entry  blocks.  The  exit 
block,  numbered  N  =  (last  flow  block  number  +  1),  succeeds  all  original 
exit  blocks. 


Varltiblc  Use  Data 


At  tho  same  time  that  a  statement  is  passed  through  the  flow  block 
scan  it  is  also  examined  for  variable  uses.  The  data  collected  here  will 
furnish  a  record  for  each  variable  of  itfi  definitlons/referenccs  In  each 
flow  block  and  at  each  node  within  cj  block.  Redundant  information  Is 
recorded,  that  Is,  by  flow  block  as  well  as  node,  since  the  next  procedure  - 
completion  of  p-  grnphs  -  can  be  more  economically  cxcculecl  over  flow 
blocks  than  individual  nodes. 

Statements  are  broken  down  into  nodes  In  the  obvious  v/ay:  assignme  nt 
statements  (or  the  assignment  part  of  a  logical  IF)  become  2  nodes  -  tho 
light  hand  side  (use  =  referenced)  followed  by  the  loft  hand  side  (use 
defined,  except  subscripts) .  The  first  (or  only)  node  of  an  IF  statement 
represents  the  conditional  expression  (use  •  referenced).  READ/WRITF 
statements  are  single  nodes  (use  »defined/roferenced). 

The  data  may  be  recorded  as  follows.  Assume  that  a  symbol  table 
for  all  variables  has  already  been  generated  and  that  the  variables  at  this 
stage  are  represented  by  pointers  into  the  symbol  table.  A  nev/  symbol 
table  field  USCPTR  will  be  temporarily  appended.  For  each  'iiriable.  Its 
USEPTit  field  will  contain  a  pointer  to  a  chain  of  data  entries  In  the  work 
area  describing  its  uses.  There  are  2  types  of  data  chains  constructed 
in  WA.  The  first,  the  flow  block  chain,  is  located  directly  by  USCPTR. 

The  fields  of  a  flow  block  entry  are: 

1)  FB#:  flow  block  number. 

2)  R  :  referenced  flag .  ON  if  variable  is  referenced 

in  this  flow  block. 

3)  D  :  defined  flag  analogous  to  R. 

4)  D  node:  node  number  of  last  definition  of  variable 

in  tho  block « 

5)  NPTR:  pointer  to  node  chain  for  this  flo’v  block. 

6)  FUNK:  ptr  to  next  flov/  block  entry  for  this  variable. 
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The  typo  cliain  i::  ilio  node  chain  which  ia  located  by  the 

Nl'TR  ijf  it.;  I  lew  block  entiy.  The  node  entry  fields  arc: 

1)  N:’  ;  node  ip.niljer 

2)  K''!'':  roloronc'' i  nr  (loflii:**!  flog 

3)  N'  l.Mlh  poii!i  !  lo  iioxl  node  entry  in  the  chain. 


Schema  flee  11'/: 
SYMIKH. 
TAllLi: 


USEPTR 


The  process  of  recording  the  above  data  is  simple.  As  each  statement  is 
passed  through  the  scan,  it  is  broken  down  into  nodes  and  each  node 
is  examined  for  variable  appearances .  The  USEPTR  field  of  each  variable 
is  accessed  to  locate  its  current  flow  block  entry  in  WA.  If  the  FB# 
of  this  entry  =  CFB  (current  flow  block),  the  R/D/DNODE  fields  are 
appropriately  updated.  If  FB#af  CFB  or  USEPTR «=  0,  a  new  flow  block  entry 
is  made  and  chained  in  via  USEPTR,  FB#  =  CFB  and  other  fields  are  set 
appropriately.  In  both  cases,  a  node  entry  is  also  generated. 


For  all  varinbles,  data  for  the  entry  and  exil  blocks  is  the  sani'-: 
for  tho  entry  block «  D  -  ON  at  its  single  node  DNOPi:  =  1;  for  the 
exit  block,  Rb  ON  t.t  its  single  node  «  (last  assigned  node  -I-  1). 

COMPLriTON  OF  Til H  F-GRAPM 

The  pr(!ceodlng  procedures  lieed  bo  performed  only  once  over  the  code 
to  bo  analyzed;  the  follov/ing  procedures  nuicl  bo  applied  soporately  to  each 
variable  in  the  code. 

As  stated  earlier,  the  complollon  process  will  be  j^oironned  over  the 
flow  blocks  of  the  program  rather  than  the  individual  nodes.  In  tjenernl, 
each  flow  block  v/ill  bo  thought  of  ns  having  2  nodes  -  an  entry  and  an 
exit  node.  Ml  entry  nodes  will  initially  be  icforenco  ('*  uncirclcd") 
nodes.  If  a  definition  takes  place  within  tho  block,  the  exit  node  will  be 
circled,  referring  to  the  last  definition  in  the  block.  The  result  of  this 
procedure  will  be  the  dotermination  of  the  "equivalence  class"  to  v/hich 
a  variable  belongs  at  the  entry  node  of  every  flow  block.  (It  is  then  trivial 
to  resolve  data  dependencies  node-wise  within  flow  blocks.)  An  equivalence 
class  will  be  defined  by  the  type  and  place  of  generation  of  tho  class, 

A  class  generated  by  a  single  definition  is  designated  type  "D";  by  a  merge, 
type  "M" .  The  place  of  generation  is  the  flow  block  at  the  node  of  definition 
in  the  first  case,  the  merge  block  at  its  entry  node  in  the  second  case. 

The  procedure  described  here  is  based  on  the  p-graph  algorithm  with 
some  modifications.  Initially,  all  originally  circled  nodes  are  "propagated" 
to  their  successors,  but  thereafter  only  new  merge  nodes  are  propagated 
until  no  more  are  found.  This  raises  the  problem  of  "zeroing  out"  only  the 
subset  of  non-merge  nodes  affected  by  a  new  merge  node.  However,  since  it 
seems  desirable  anyway  to  keep  track  of  the  set  of  equivalence  classes  related 
to  each  merge  node,  this  can  be  done  in  a  way  that  will  record  the  equivalence 
class  associated  with  every  entry  to  a  merge.  Thus  a  now  class  always 
overlays  the  old  class  in  Its  assigned  slot  without  destroying  any  Information. 
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•;  ilv.ih  IK'  .  .•J.isne::,  The  eleim  nt  TBCM  (1,/) 

(ii  1. 

•i  i.ull)  1 

,  ,1,  .. 

'  .•I  ':.;':  e.isr.cd  fro::!  flow  block  1  to 

i\<  V.- 

1.  r.. 

:  ii  • 

Ml  t  iiiiv  Is  ‘type ''Th'o'  wiieie  type  «=  D  or  M , 

rr." 

ni;ri  !>.  r  . 

1  5;  . 

r«  vii  I-  <  .lion  of  cli'ss  occuru. 

iidi 

;•  -My  h  ’l.  .■ 

::  .  nti  V  »  •  i 

I  flov/  hh-c  k.  Entry  for.n  it: 

1  It  i  1 

'  1 )?.'(.  It 

s  •  M  KncViN(M)i;l 

This 

t'lbl  •  ilcsf 

!  ;  the  c'k 

!i .!  iiod  .iivi  the  equlviilcnce  cl.iss  on 

(Miiiy 

1.'!  (Me’i  IJ. 

vhlock.  Vi 

I.  K.  D,  I'/.Onr  flf.Ms  are  token  directly 

fn.'i‘i 

the  flv.w  I'.I.  . 

ch.iin  d.iui  gi/noiMi.  1  In  the  Kir.t  procedure.  The  M 

fl.io  is  furiicJ  ('ii  ■.vh«*n  o  I'lo.v  block  is  found  to  be  n  me'ge  point  of 
mor.>  t!.  n  1  t  qi!i\.ilencv  ci  The  KKC^'M^ODH  ilcscrfbcs  the  entry  cqu-vu- 
lonc*'  cl.iss  tcir  this  flow  block  unless  M  Is  OW,  when  It  designates 

the  ciUry  node  (Mi.’OL'E)  Irirtciul. 

M!  I'll  ”  Merge  The  numbers  of  tr-crge  blocks  are  placed  on  this 

list  .iji  they  .ire  discovered  during  the  procedure, 

SST '.(•K  -  Successor  Stock.  Push  dov/n  stock  used  to  store  the  numbers  of 
successor  flow  blocks  of  .i  node  during  the  propogotion  process.  Entry 
forin  t:  rru,  SIT-  (see  below) 

CEC  -  Current  rquivalcnco  Class.  Same  format  os  ECM  entry. 

CPB  -  Current  How  Block . 

PPP  -  Propogotlng  Plow  Block , 

SPB  *- 


Successor  Plow  Block 


for  iiny  K  . 

flow  block? 

Y  I 

Cl'P.  ■  O 

I  ^ 

D  (Cl  i:)  =  OWV  N. 

Y  I 


/  Propu:4atc  \ 

/  CI.C  -  \ 

■  D/C1'!1  ^  / 


V 


Next 

flov.*  block 


last 

_..,i 


— >  erb 


Take  entry 
from  MLIST 
-  CFB 


no 

more 


/ _ Y _ D(CFB)*  ON? 

\  • 

N  ! 


Tropagalc 
/  CnC" 
\^M/CrB 


*  IJ/O 


crii  ^  1 


I 


I 


45. 


1  I’M  ^  CrM 


Sl.u  '-.  .V  ticccJ.s:;'.M:. 

If'  Ml'I’.  (I’SC?/  vv  jfrrB) 
Oil  S.Ji'ACK 


\ 


1\)|).  HSTACK  no 

eiViv-  PPM,  .sri’.  niorc' 


KM'.TIjnN. 
^  / 


. LCa!  (PFB,  srii)  =  CL'C? 

N 


V 

Sot  ECM  (M  B,  SFB)  ^  C’.JiC 


Y** 


<  -  ■ 


Y** 


.  M  (  SFB)  ‘  ON? 


Scan  ECM  column _  .  N. 

#SFB  for  any 
entry  ^  CEC 

Y*** 


Set  M(SFB)=  ON 
Add  SFB  to  MUST 


f . 

(SFB)  »  ON? 
I  N 

-  - 


i 


*  Slop  propagnilon  at  conipletlon  of  cycle 
**  Sioi»  proparjiiHon  at  clrdod  node 
***  N'  vv  merge 


Tlio  procedure;  is  con)j..'Joicd  by  thc’  fillincj  in  of  iIk'  NC'.//MN()D1J  fields 
of  tho  use  table.  If  K'  ■-  On-  foi.r.ay.tl.G  ruUy  for  flow  b.1r,e);  fj,  NEC 
is  obtained  by  scanning  cohnnn  J  lor  .my  non-  nuJl  .  iiLiy  ('i)J  -.irr;  the  so'iio). 
If  M  -  ON,  MNODE  is  set  to  Iho  cnlry  node  of  Liio  (lo.v  ))](.■?;;'  (NODEN 
field  of  tho  Flow  Rlocl;  Tuble).  Nolo  th  il  llu;  f.i.L  of  iiiliitju;-;  closscs  in 
the  latter  c.rsc  is  obtciineblo  from  the  ECNf  column  correriponcilng  to  Lhe 
morgo  blocl: . 

Tho  cjompleled  USE  Table,  together  v;llh  the  FBCM  giving  the?  oonneclions 
between  flov;  blocks,  defines  the  flow  block  p-cjiaph. 

The  next  procedure  passes  over  tho  nodes  of  the  flow  blocks  to  nvike  the 
final  assignment  of  equivalence  classes.  The  Information  is  obtained 
from  tho  USE  Table  and  the  node  chains  gencrotod  in  tho  Variable  Use 
Data  section.  Three  fields  will  be  added  to  each  node  entry  to  record 
the  equivalence  class: 

FDEF  =  flow  block  number  of  the  class 
NDEF  a  node  number  of  the  class 
TDEF  =  type  of  class:  D  or  M 


47. 


L/'Pil  ML'-'  •4 l)y  1I?’I'!PTU  (varltiljlo): 

11  Md'i:  )  (..'I’)’: 

TDi:!',  -  i'vi:o(rBid 

If  Ti)i:i'  -  D,  -  DNODr.  (rDtr) 

If  Tnt:r  --  m,  iv.i')i:r  =  mnoi.mj  (]’Di;r) 

If  Md’Bid  ON: 

TDllf  M 
FDKF'  ^  I’ni; 

NDIJF  -  MN0011(rB!i) 

Loop  node  f‘lu\in  by  NPIKCPB-f): 

If  R(Nv)  ON,  cnicr  current  TDIT,  FOBF,  NDFl’  in  node  entry 

If  D(N!d  -  ON,  roijot  FDIT  FB# 

NDFF  =  Nif 
TDFF  D 

■Next  node  entry 
end  of  chain 

Next  flow  block  entry 

j  end  of  chain 

•[/ 

Exit  • 


-IB. 


SUBSCRIPTED  V/^.RIAIi] US 

The  previous  analysk;  does  not  distingui.sli  !)oiwcen  diffeionl  clcmonts  of  tlio 
same  array  in  deterniining  data  dependencies.  Vv'orl:  in  this  area  has  been 
done  only  within  the  iiniited  cer.U'Xt  of  the  IIJ.TAC  probvlnn,  that  is,  the 
analysis  of  DO  Indc;-  subscrip'a  d  variables  in  DO  lo'>.)!'  code  for  tlic; 
purpose  of  finding  intet— loop  fii.r  eiidencic.s.  The  previous  proc  viuros  can 
be  rather  simply  inodificd  to  find  dependeuejes  in  Lhis  special  ciiso; 


V'-^rlablc  I Ise  Data 

A  "subscript"  chain  i:;  interpur<'  i  at  a  leva]  higher  IhjTi  ihc  flow  ]>locl; 
chain,  A  subscript  entry  is  ukkIo  for  each  unique  subsciipt  form  associated 
with  an  array  variable. 


SYMBOL 

TABLE  USEPTR 


ARRAY  VAR. 


iss|  IPPTR 


SSUb'K 


Lfik ...  1 

etc. 


i 

I 

i 
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This  luoco'liirc ,  :l  ;  tlu'  propucjat’  mi  process,  is  performed  separately 
foi  I  I'  h  siihscr: . ■!  h.'iin  of  tlu'  array  variable.  In  this  case,  multiple 
nCM  .;;;d  IISI’,  t.h  h's  must  Is.'  ’;ept  -  a  not  for  each  nubscript  form. 

The'  ratbsfdpt  fo-i  ;;.  arc  th.  u  i.xamiiK'd  and  arc  as^.lgned  indices  1,2,  ...  ,n 

cieaurtling  to  -!(  cicasing  '  of  lh('  sobscripls.  /'ll  inter-loop  dependency  occurs 
wIkui  ili'.-re  is  a  a.-.e  of  an  initial  value  (equivalence  class  D/O)  on  the  p-graph 
corn-sijondincj  t-.j  iuuox  n'  >  I  and  thoa-  is  anotlun*  graph  of  index  n"  <  n' 
shov’ii  g  a  non-iu'lial  value  al  the  exit  block.  That  is,  in  this  case  the 
value  for  the  D  T.i  use  was  .‘crtua  iiy  g(  nc^rated  in  a  previous  iteration  of  the 
loop . 

The  following  pr.'codure  uj.  latos  the  data  tables  to  reflect  inlcr-loop 
depeudc  iicies .  it  is  nocossar/  now  to  append  a  subscript  index  field 
to  L’ChM  entries  nlr.co  there  will  be  "cross  p-grnph"  references.  (In  the 
initial  completioM,  all  indices  -  the  index  of  the  graph).  The  NEC  field 
of  USi’.  entries  must  bo  similarly  expanded. 

(1)  1  -  1 

(2)  The  NEC^IdoTiDE  liclds  of  the  USE  table  for  Index  I  are  completed. 

The  value  (entry  equivalence  class)  for  the  exit  block  will  re-deflne 
all  D/O  equlvolonco  classes  for  index  (I+l).  If  this  value  is  itself 
D/O,  skip  to  step  (4). 

(3)  The  ECM  corresponding  to  index  (1+ 1)  is  scanned  and  all  appearances 
of  D/O  arc  icidaced  by  the  above  value,  setting  subscript  indcx*= 
index  of  this  value.  In  particular,  values  in  the  column  corresponding 
to  the  exit  block  are  now  updated. 


(4)  I  -  I-i- 1 . 


If  I  n,  go  back  to  (2). 


APPrNDIX  B:  OB«rBVATIONS  ON  PARTIAL  OR'OERING: 


Problem: 

Def  1: 


Def  2: 

Def  3: 
Def  4: 

Def  5; 


Given  ci  finite  set  A  one!  an  irreflexive  partial  order  r  C  '  > 
how  mr'ny  linear  orders  on  are  consistent  with  r? 
rCA  X  A  is  on  irreflejdve  partial  order  if 

1)  V  a  c  A,  (a  ,0)  cf  r 

2)  fa ,  b)  (b ,  c)  c  r 

Note:  (a,b)  e  r  is  also  v.'ritton  in;  0  <^b  or  a  b  If  r  is  understood. 
Lemma  1:  if  r^  arc  (irreflexi ve)  partial  t-iders  then  P  T2  is 

a  partial  order. 

Proof;  1)  V  ii  e  A,  (a, a)  ^  C 

2)  (a,b)  c  e  r^,  Tg 

(b,c)  e  r^  =>  (b,c)  c  r^,  T2 
thus  (a,c)  c  tj ,  and  (a,c)  g  r^,  so  (a,c)  c  r^P  T2. 

L  C  A  X  .A  i.s  a  linear  order  if  It  is  im  irreflexivc  partial  order 
and  V  a,  b  c  A,  a  b  ">  (b,a)  e.  r  or  (u,b)  e  r 
Linear  orders  arc  also  called  chalris . 

If  {aj, . .  C  A  is  linc.arly  ordered  by  a^  <  a^  <  ...  <0^^,  then 
we  denote  this  by  <  a j ,  <12# . .  .a^  >C  r  [i.e.  <a ^ ,  ,a^  >  :• 

I  15  1< Jg  n)] 

The  restriction  of  r  to  BC  A,  written  rjB,  is  the  sot{(a,b)  c  r|a,b,e  B] 
The  primitive  of  a  partial  order  r,  wrlUun  p(r)  is  the  set 
{ (a,b)  e  r|  ^  c  e  A  (a,c) ,  (c,b)  e  r  )  . 

Note;  In  Graph  Theory  the  primitive  is  sometimes  called  the 
Hasse  diagram. 

The  transitive  closure  of  a  subset  QC  A  x  A  is  defined  by; 

C  (0)  =P  {0^  DQ  1  O'C  A  X  A  is  a  partial  order) 

It  follows  immediately  from  Lemma  1  that  CiQ)  is  either  a  partial 
order  or  all  of  A  x  A. 


i)c;! 


If) 

i . ■'*  Ii. 

),...,{.i  1. )}  then  »!.'•  ('.xtracl  oi  (,i, 

1  ni  ni.j^ 

\vi  ii  I'M  q- ' 

I.i  '■li.'fill,  ; 

by 

I'q.-I 

‘  f 

i  1  ,  .  .  .:i 

■'  (i'(r)  >  , 

"Iso,  if  X 

,  .'X2  IhcnC(Xj)  :')C(X2) 

1  l  i.  1'/L  '  '■  ' 

/•  ::  A 

^  A^.  If  0  is  a  partial  order  and  Q~^j,  tlicn  Q~1^2 

I  in  is:  X.;  ! 

i 

:Xj)  but  by 

i. omnia  1,  C(X)  is  a  partial  order  (unless  it 

is  A  >  A  ii. 

'.vhlcil  C.,': 

c  (:(X  y)  ■  C  (X2)  for  iiny  X2C  A  x  A) 

'i'lii.1;: ,  sin  - 

C(Xj)is  . 

.  partial  ord^r  containing  X2,  the  intersection  of  all 

p.irl-ial  (■!  ' 

; coni  Ui 

X2  is  c.onlaincd  in  C(Xj).  Thus  dX^)  C'  C(Xj). 

Nc'W  con. si 

if)  r  and  i i 

s  primitive  p(r).  Since  p(r)  Cr»C’'(l)(r))C  f* 

/.fiSl.flU!  tl! 

ilicllisio!! 

is  proper.  Thus  “j  (a,b)  g  r  such  that  (a,b)  ■(  C(p(r)) 

Now  •;  0 

(u,c),  (c; 

,b)  G  r  for  otherwise  (a,b)  c  p(r)C  C(p  (  r)).  Also 

iiol  both  ( 1  ,r)  ond  {c.b)  may  be  in  r(r)  for  cthorwise  uny  partial 
carder  coni  ,! nincj  p(r)  would  contain  (a,b)  by  transitivity  and  (ci,b) 
would  b(!  in  n(r),  Asnumo  without  loss  of  cjcnerality  that  (a,c)  /  p(r), 
Tluis  ",  Cj  j)  (a,Cj),  (Cj,  c)  G  r.  Further  Cj  yf  {a,b,c}  for  by 
Irunsitivilv  (c,b)  c  r  and  we  know  tliat  (Cj,  c)  and  (a,c)  c.  r.  At  the 

nlh  step  we  hove  c„  /  (o.b.c.Cj.c^ . o„.j)  . 

Thus  wc  will  cr(3nto  an  infinite  sequence  of  distinct  points  of  A. 

But  A  is  finite.  Hence  the  inclusion  cannot  be  proper,  and  C  (p(r))  -  r. 

Lcniina  3;  J'or  any  p..riial  order  r  there  is  a  linear  order  L  such  that  LC 
Also,  if  1.  is  a  linear  order  on  A  and  o(A)  -  n,  o  (L)  -  n(n-l)/2 
hot  I  be  a  partial  order  on  A,  If  r  is  linear  we  are  done,  so 
assume  it  is  not.  Tims  j  ^  {a»b)  and  (b,n)  t!  r.  Let 

fj  -  C(i  I  { (a,b)}),  Pj  is  .1  partial  order  properly  containing  r. 

Aasunio  i  j  not  linear.  Thus  a^,  bj  g  A  .“3  (a  j  ,bj^and  (bj,aj)  r. 
i’rncccdii,!.;  in  this  manner  wc  can  construct:  an  infinite  ascending 
.'■c  quonct'  o'  subsets  of  A  x  A.  Thus  for  come  n,r^  fs  a  linear  order, 
•io.v  lal  (;(.  )  M  and  I,  be  a  lliKtar  order  on  A,  Consider  x  g  A 


VycA  .?  y  X,  (x,y)  e  L  or  (y,x)  e.  L  hul:  not  both.  Thus  cnch  >:  .ippoars  in 
n-1  clcMiionts  of  T...  Thus  o(f,)  :  n  (n-  D/j,  wheio  j  is  tho  nuinl:>or  of 
tlnius  wo  have  counted  each  pair,  but  j  -  2  because  (x,y)  is  counted 
exactly  twice:  onen  ns  a  poir  conbiinliiw  x  nnd  once  ns  n  pair 
containing  y. 


Ik  f  '/:  .1  i'l  '  iv'  '  ••■•.siu  I’i  i.IiMiK'jiii  when  A  it-:  onUmjcl  l>y  r,  is 

1',  {••'  (:■  .  I  pK)  I 

i'l  'III'.'  ■■  '  K.i'iii!  .1,  A,  AoiAi’H’cI  by  r,  is 

^',  (>0  {:■•  .'1  (H.b)  p(.)| 

T;!.'  pi  'Mi.  r  ...  'I  Hm!  1(  . -j )  ;;,-t  cA  siib.'-.ot  X  of  A  is  tho  set  of 

;  il)  .iii  ;  of  {yc<i|  x.-X”>  y<x}  ({y  cA  1  x*"  }) 

I  )''l  .0  .1  K.  i'iji.  1  ^,’j  -1  p  uti.ij  order  r  is  a  point  a  such  that: 

1)  uc  P(r)' 

o(lY(.i))  '  “  or  o(r^(.>))  •  2  (or  both) 

If  o  (I'j.)  (a))  r  2,  ii  is  a  left  nienjo  point.  Similarly,  if  o(F  (a))  2,  a 
i?;  a  right  i:u  igi-  point 
hot  /,'  (j)  1'.;  Ih.;  sot  rit  merge  points  of  r 
Dei  9:  A  oliain  <a  ^  .  •aj|>  (7r  is  a  base  chain  of  r  iff; 

])  no  a.  ip  e  iiic  rge  point  of  i  for  1  <  i  <  n 

2)  (b,a  p(r)  b  c  (r) 

3)  ,c)  s  p(r)  ->  c  c  /<{r) 

Dot  10:  A  fiiiir  of  merge  points  of  r,  Pj»P2)  •  ready  if  there  are  base  chains 
<c\^, . .  iind  <bj,. .  .b^>  of  r  such  that  {(Pj  AJj) ,  (P|,b^) ,  ('■^„»P2) 

(b,.pp)Cp(r) 

I.omina  4:f[  <.a, , . .  .a  >  and  <b  ,  •  •  >  are  base  chains  of  r  they  are  equal  or 

1  n  1  m 

disjoint. 

Proof:  Suppose  a,  c  <a,,.,.n  ,  We  will  show  that 

-  i  1  n 

<a  , . .  .a  >  is  the;  only  base  chain  containing  a, ,  This  will  clearly  be 
1  n  1 

sufficient  to  prove  the  hc?nimci.  Thus  let  <b, , , .  .b  >  be  another  base 
chain  containing  vi.  and  let  a^  ==  b^. 

If  i  --  1  then  <p  b^  so  P  (a.)  C  p(r) .  Thus  (b^)  T'  p(r)  and  j-  1  since 

no  inen;!)'-r  (A  a  ba5;e  chain  i?;  merge  point.  If  i  /  1  then  P^,  (a^ 

ai:A  !’  (p  -  {b._i)  ,  n)  a,  i  -  b  ,  and  orocoeding  by  indiicLion  a,  -  b, , 

‘  J  )  1  '  1  i. 

W.  c.in  ;:ii:.'!.n  ly  siic'-.v  ih.it  fin’  any  h,  Oj,  --  bj.,  Tims  we  arc  done  unless 


'o4. 


m  /  n.  Ar)Sunio  ni  )  n  wltiioul  loss  o*  cjcucrality .  Since  CJ  or 

{a  niercjo  point]  then  1’^  (b^^)  •  d  or  (a  ni'  lyc  point j  .  Thus 

b  e  <  b b  >,  so  in  n . 
nf  1  1  in 
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Def  11:  Tho  coniplotion  of  r  on  A  is  the  partial  order  on  the  set 

A  X,  p]  A*  defined  by 
r*  =  C  (p([)  U  XuY) ,  where 
X  r:  [  (  A ,  a)  1  (a)  0  3 

Y=  {a,  p)  I  (a)  0  ] 

Def  12:  A  lattice  is  a  partial  order  such  that  the  operations  inf  and  sup 

on  all  pairs  of  elements  are  well  defined. 

Def  13:  A  path  from  a  to  b  (both  in  A)  is  an  n-tuple  (Cj, . . .  ,0^^)  such  that 

1)  cj..a;c^=b 

2)  V  i  £  n-1,  (c^,  Cj  j  j )  e  p(r) 

Def  14:  X  C  A  is  o  cluster  if 

1)  o(x)  r-^  2 

2)  X,  y  c  X  r.>  p^(x)  r;  P^(y)  and  F^(x)=Fj.(y) 

Def  15:  TC  r  is  a  tangle  if 

1)  There  is  a  least  element  a  and  a  greatest  element  g  in  T 

2)  .3^1'  ^2 '  “  '^5  ® 

a)  bj<b2,b3,b^,b5 

b)  b2 ,  bg ,  <  b^ 

c)  b2  <b5 

d)  no  relation  in  r  holds  between  b2 ,  bg  or  b^  ,  bg 

e)  if  sup  {b2,  b^)  and  inf  {b^,  bg)  exist,  sup  inf 
Theorem  1:  If  r  is  a  partial  order  on  A  then  r*  is  a  lattice  or  r*  contains  a 

tangle . 

Proof;  Suppose  r*  is  not  a  lattice.  Then  since  aGA=>  X<a<p  the  only 

way  r*  can  fail  to  be  a  lattice  if  if  3  a,b,  e  A  such  that  (ccfA|  o5a,c?;b3 
contains  at  least  two  minimal  elements,  or  if  a  similar  statement 
holds  about  {cgA|  cga,c?b} 

Assume  without  loss  of  generality  that  Cj  and  C2  are  distinct 
minimal  elements  of  [  c  e  A  |'  c  S  a ,  c  S  b]  .  I  now  claim  that  there 
is  a  tangle  in  r*. 


Proof  of  Claim;  Let  a  =  A,  p  -  p .  -  a ,  b2  --  ,  bj  •-  b,  -  c^,  b^  .-.  • 

Then  we  have  a)  bj  <  b2 ,  ,  b^  ,  trivially 

b)  b2,  bg  <  b^  since  b^  e  {ce.A|  c  :?:  b,c,  i-  A}  and 
intersection  of  this  set  with  [Q,b}  is  0'  because 
if  a  (say)  is  in  the  set  then  it  would  be  the  only 
minimal  clement  and  we  have  assumed  there 
are  at  least  two . 

c)  b„  <  b.  for  the  same  reason, 

L  o 

d)  No  relation  holds  between  ‘^‘tid  becaiuso 
otherwise  the  greater  one  Vi/ould  be  in  {ci;A|  csb, 
c  &  a)  .  No  relation  holds  between  b^  and  b^ 
since  they  are  distinct  minima]  elements  of  a 
set. 

e)  sup  {b2,  bg  3  dees  not  exist  by  assumption. 

Thus  if  V*  Is  not  a  lattice  it  contains  a  tangle. 

Def  16:  X  C  Y  is  a  segment  iff: 

1)  Y  Is  a  lattice. 

2) y  a »  p  e  X‘  such  that 

a)  XGX'=>a^xsp 

b)  a  g  X  s  p  _>  X  G  X‘  or  X  is  a  member  of 
a  base  chain  of  Y  connecting  a  and  p  . 

Def  17:  A  segment  X  is  closed  under  r  if  V  x  g  X^  , 

(y.x)  G  p{r)  or  (x,y)  g  p(r)  =>  x  =  sup  X*  or  x  inf  X'  or  y  g  X*  . 
Theorem  2:  If  X  Is  a  closed  segment  of  Y  and  X  contains  a  tangle,  cluster, 
or  ready  pair  of  merge  points  then  Y  contains  the  same  tangle, 
cluster  or  ready  pair. 

Proof;  1)  t^glo:  Since  Y  C  X/  the  X~tangle  would  fail  to  be  a  Y- tangle 
if  b2  and  or  b^  and  b^  were  comparable  in  Y.  But  for  this  to  be 
true  (say  for  example  b2  <y  1)3)  then  7j  ye  Y'  -  X'  <  y  <  b3. 
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But  since  neither  nor  can  be  inf  X‘  or  sup  X'  this  is  impossible. 

The  only  other  possibility  is  that  inf  (b^,  b^}  in^  fb^,  bg}  or 

sup  {b^,  bg}  <:  ^  sup  {b2,  b^}  and  this  would  violate  closure. 

Y  X 

2)  cluster;  an  X-cluster  C  could  fail  to  be  a  Y-cluster  only  if: 

a)  Z!.  y  e  Y'  “X* ,  c^,  Cg  e  C  such  that  Cj  <  y  <  Cg ,  or 

b)  3  y  s  Y’’  -X'  ,  Cj,  C2  e  C  ;>  y  <  Cj  and  y  Cg  or  y  >  Cj  and 

yy  02- 

But  either  of  these  conditions  would  contradict  the  closedness  of  X. 

3)  ready  pair;  If  (Xj ,  X2)  is  X-ready  then  the  only  way  it  can 
fail  to  be  Y-rcady  is  if  one  of  the  chains  connecting  Xj  and  X2 

is  not  a  base  chain  of  Y.  But  this  can  only  happen  if  .f]  y  e  Y  -X‘, 
bj  e  <  bj, . . ,  ,b^  >  connecting  Xj  and  X2  such  that  (y,bp  e  p(Y)  or 
(bj»y)  e  p(Y) .  But  this  contradicts  the  closedness  of  X. 

Theorem  3;  If  a  lattice  contains  a  non-triviul  (i.e,  one  that  does  not 

consist  of  a  single  chain)  segment  which  contains  a  tangle, 
cluster  or  ready  pair  then  the  lattice  also  contains  a  tangle, 
cluster  or  ready  pair  (not  necessarily  the  same  one).  Let  r  be  a 
lattice,  X  a  segment  of  r. 

Proof;  Consider  Z  t=  fz  e  XM  inf  u(r|X)  g  z  g  sup  M(r|X). 

rjx*  r|X' 

rj  Z  is  a  segment  of  r  and  any  tangle,  cluster  or  ready  pair  in  X 
must  also  be  in  r|  Z.  If  rj  Z  is  closed  under  r  then  by  Theorem  2 
we  are  done.  Therefore,  assume  rj  Z  is  not  closed  under  r,  and 
v/ithout  loss  of  generality  assume  y  e  r‘  -  Z  ,  a  e  Z  >>  (z,y)  e  r. 

Then  let  b^  -  inf  Z,  b2  -  z,  bg  -  any  element  of  Z  not  comparable  or 
equal  to  Z,  b^  =  sup  Z,  bg  =  y.  This  is  clearly  an  r-tangle. 

The  only  problem  which  could  arise  is  if  z  above  is  comparable  to 
every  other  point  of  Z,  If  not  all  elements  of  Z  which  connect  to  the 
outside  (i.e.  violate  closure)  have  this  property  then  v/e  simply 


choose  one  v/hich  does  not  have  it  and  v/e  arc  done.  T)j'..ig  assuiGj 
Zq  G  Z,  y  e  r'  -  Z  v/lth  (V/Zq)  g  p(r)  or  y)  e  p(r)^->V  z  r;  , 

(z^z)  e  r|  Z  or  (z,  z^)  r:  rjZ. 

Let  the  elements  of  Z  satisfying  tins  condiUoi)  bo  Z*  ^  ^2'  ‘ 

and  assume  z^  <  Z2  <  ...  <Zj, . 

We  know  thot  it  any  (z . ,  z.)  is  ready  in  rjZ  then  by  Tlioorem  2  wo  are 

done,  so  at  most  one  bose  chain  of  rjZ  connectr.;  any  two  e.lerrients  of  Z 
Now  let  Z°=  [ZfiZ  1  zr;  Z^--^  [zeZ  j  2,  r;  ,  lsi-:];~l; 

Z^'-=  {zeZ  I  z^.  g  z}  . 

It  is  clear  that,  the  rj  7}  are  segments  of  r,  and  J  claim  now 
that  they  are  closed  under  r.  For  assume  3  y  e  r'  -  E^,  z^  e  Z^  •'» 

(y*  z  )  G  p(r)  or  (2  y)  g  p(r)  but  y  i.s  not  comparable  v;ith  either 

O  Of 

Inf  Z^  or  supZ^'.  Then,  since  z^  cannot  be  a  point  violating 
closure  of  Z,  ye  Z.  But  V  z  c  Z,  sup  and  inf  ?}  are 
comparable  to  z.  Hence  the  Z^  are  closed  under  r. 

I  now  claim  further  that  any  tangle,  cluster  or  ready  pair  of  Z 
must  lie  wholly  within  one  of  the  Z^, 

9)  ready  pair;  Assume  {a,b}  C  2  is  ready  and  a  <  b,  a  g  z\  If 
b  then  a  is  an  end-point  of  t}  since  the  Z'  are  closed, But  nov/  let 
be  Z^ ,  so  b  must  be  an  endpoint  of  ?} ,  But  by  construction  of 
the  zS  no  pair  of  endpoints  can  be  ready  unless  they  are  endpoints 
of  the  same  Z^. 

b)  tangle:  In  this  case  we  v/ill  modify  the  claim  to  assert  that  if 
there  is  a  tangle  in  Z  then  one  must  lie  wholly  within  a  7} .  To  show 
this  it  is  sufficient  to  show  that  in  a  Z-tangle  b2»  and  bg 

are  in  the  same  Z^  for  none  of  them  may  be  on  endpoint,  so  the 
endpoints  of  tlic  Z^  will  do  for  a  snd  ,s  . 

Thus  assume  there  is  a  Z-tangle.  Clearly  h,^  and  b^  arc  in  the  same 
as  bo  and  b.  respectively,  for  any  olcmont  of  one  z’  is  comparable 

si  C) 


tu  itiy  olerr.enl  of  another.  ^Iso,  since  sup  {h^  ,  b^]  ■/.  inf 
{b^,  bj^}  bg  i^>  in  the  same  as  the  sup.  But;  this  is  the  same 
as  the  one  and  b^  are  in  so  h^,  b^,  b^  and  are  all  in  the 
same  /  . 

c)  cluster:  Since  no  two  elements  of  a  cluster  are  eomparable 
they  all  must  be  in  the  same  Z^. 

Now  since  the  t}  arc  closed  under  r,  any  tangle,  cluster  or 
ready  pair  in  a  is  one  in  r.  But  any  one  in  X  is  one  in  a  z/. 
Hence  anyone  in  X  is  one  in  r  and  w’e  are  done. 


Procedure: 

The  purpose  of  this  procedure  is  to  construct  an  ascending  sequence  of 
partial  orders  on  A,  (W^ ,  . .  .W^) ,  together  with  a  sequence  of  numbers  (Tj , . . .  ,1’^^) 
with  the  property  that  if  is  the  number  of  linear  orders  on  the  set  W^,  is 

a  constant  for  all  i. 

To  do  this  we  will  define  set  functions  Vj  and  ^2  defined  on  partial  orders  of  A , 

The  range  of  Vj  v/ill  be  partial  orders  of  A  and  the  range  of  V2  will  be  the  natural 
numbers . 

Now; 

Given  A,  rC  A  x  A  an  irreflexive  partial  order: 


Wj  =  r,  Tj  =  1 

Wo  =  r  U<a, , . . .  ,a  >  where  <a, , . . .  ,a  >  is  any  linear  order  of  the 
2  I  n  I  n 

elements  of  A  -  r'  ;  T2  =  [o(A-r‘)3l 


W3  ==  wy,  Tj  =  Tj 

for  n>  3,  .  Vj  (W„.y;  T„  =  Cv^W^.j)]  T„.j 

Definition  of  v^,  V2  (argument  will  be  called  r); 

1)  If  there  is  an  r-cluster  X  C  A,  then  if  X  =  [Xj, . . .  ,x^} 

V,  (r)  =  G[r|  (A  -X)U{(afXj)|  a  e  (X)  }  U  b  e  (X)  ]  U<Xj 

V2  (r)  =  n! 


/X  >1 
n  ■' 


2)  If  there  is  no  cluster  contuinod  in  r  but  Pj,  •-  /<(r)  (Pj  > 

ready,  then  if  x, ,  x„, . .  .x  ore  all  the  base  chains  of  r  connccLini 
12  n 

Vj^(r)  =r  C  [  (p(r)-(j;  (sup  X.’  ,  P2)  1  1  ?  i  -  n-l}  UJ  (Pj,  inf  x^’  )  j  2 
{(sup  xj,  inf  x,'_^j)  1  irisn-l]] 


V2  (r) 


n 

2  jf'U:)!  ! 
n 

n[je(x.)!] 

i=l  ^ 


P2)  is 
u  Pj  and  p 
id  n] ))  L J 


3)  II  ihoro  ii;  no  clusLt:i-  or  ruculy  puii  ,  then 


u)  r  coiiaistn  oI  one  ctiain,  in  which  ease  Vj(r)  =  r,  V2(r)  =  1,  or 
b)  Thcrc^  is  a  tongJo  T  i'._  r  (see  Theorem  4  below). 

In  case  (b)  consider  the  sni  Y  of  base  chains  of  r  which  are  contained  in 


T  and  Vv'hich  meet  the  following  conditions: 

1)  y  K  Y=>d(T)  >  d(Tl(T'-y')) 

2)  y  c  Y  there  is  a  path  from  to  containing  no 
element  of  y' . 


Notc;:(Xy)=  (y),fZj,}=  Fj  (V) 

Now  denote  the  paths  from  to  by  1  $  i  ky 

For  each  y  partition  the  set  of  paths  into  equivalence  classes  by  number  of  merge 
points  on  each  path.  Choose  from  the  class  with  the  fewest  merge  points  any 
path  which  has  the  maximal  number  of  points  of  all  paths  in  the  class. 

Call  this  path 

Let  l(y)  be  the  length  of  chain  y 

m(y)  be  this  number  of  merge  points  on  iTy 

g(y)  =  rj,(y)  ±  m(y)  .n  i 

[l(y)]!rm(y->jl 


Choose  y*e  Y  such  thatg(y*)is  minimal , 

Now  considering  only  the  merge  points  in  iTy*  we  have  g(y*) 
order-preserving  permutations  of  these  with  the  points  of  y. 

In  other  words  there  are  g(y'')  linear  orders  on  the  set  y*'U  [  ju  (r)P  TTy*]  which 
are  consistent  with  r  on  this  set. 

Let  i-i.  (rlPi  TTy*  be  {m^, . .  .m^}  ,  n  =  m(y*) 


y*  =  <Zy.. 
[  A  2  /  •  • '  / 


i  =  i{y*) 

A  /  be  the  linear  orders  mentioned  above. 


The  sets  are  partitions  of  y*  and  tt  .  representing  the  points  between 

y* 


successive  pairs  of  merge  point£’. 
Now,  if  L  is  any  linear  order  on  T' , 
Vj(r)  -  (r-T)U  L 


Theorem  4:  If  r  is  a  partial  order  on  A  such  that  r'  =  A  and  if  r  is  not 

a  linear  order,  then  r*  must  have  at  least  one  cluster,  tangle  or  ready 
pair  of  merge  points . 

Proof;  We  proceed  by  induction  or  the  number  of  base  chains  of  r* . 
a)  There  is  no  base  chain  of  r*. 

Thus,  since  r'  =  A,  all  members  of  A  are  merge  points  of  r*.  We  will  show 
there  is  a  cluster  or  tangle  in  r*.  We  may  assume  r*  is  a  lattice  or  we 
would  be  done  by  Theorem  1.  Consider  ).  Each  element  of  this  set 

has  at  least  two  successors  because  they  arc  all  merge  points  and  they  have 
only  one  predecessor  (\  ).  V/e  knov^  a,b  e  \  )  ">  F^*(b)  F^^(o)  because  r*  is 

a  lattice.  Now  assume  ^  a  -  h  e  F^^(X)  Fj„^.(a)P  F^^. (b)  ■/  0  .  Since  r*  is  a  lattice 
this  means  Z\  c  .5F^.^.(op'  Fj..j^.(b)  r.:  {c}  .  Let  ^ .  b-.,  ''  ®  b^^b,  b^t.-c,  b^  s  r^,.^(a)-[c]  . 


and  b,  hcwcvor,  wn.-j  ih.ti  (a^ ;  -  p’  ^ind  P^*(^)  fl  T  0-  We 

will  now  shuv/  that  there  arc  alwci';  n  an  ci  and  b  inoeting  these  conditions. 
Led  M  =  maxiirium  length  ol  all  patlir'  iiom  \  to  /)  :  we  v/ill  proceed  by 
induction  on  Ivl . 

If  M  d  2  then  wo  cannot  have  a  lattice  with  all  merge  points,  so  let  M  =  3. 


A 


P 


The  above  graph  is  a  lattice  of  all  merge  points  with  M  =  3. 

Now  consider  r^,^.  (A).  Tltis  has  at  least  two  elements  because  otherwise 

j  ao  (o)  v/ith  only  onei  precedcr  and  one  follower  and  hence  violating 
our  conditions .  Now  consider  r^*(a) ,  ac  )•  P  F^y,(a)  because 

a  would  then  not  be  a  merge  point.  Also , o (F^,^(a)  )  >  1  for  the  same  reason. 
Now  )  ={P}  ,  so  F^.^.  (a)  is  a  cluster  unless  J  ^  ^  ' 

c  e  F^*(a) ,  b  ^  a ,  a  c  e  F^^.  (b) ,  But  this  would  mean  that  (b)P  F^.*  (a)  0 

and  we  would  have  a  tangle. 

Therefore,  assume  our  hypothesis  is  true  for  3  §  M  ^  k.  We  wish  to  prove  it 
for  M  =  k  +  1 , 

Let  us  therefore  suppose  that  we  have  r*3  M  =  k  l,ju  (r*)  ^  A.  Consider 
F^^  (A  ) ,  We  may  assume  that  o(F^*  A))  >  1  for  otherwise  r*  )  [A*  -  F^.*  (A  )  ] 
would  have  a  tangle,  cluster,  or  ready  pair  by  our  Induction  hypothesis  ,  hence 
so  would  r*  I  A*  -  {  A  ]  ,  and  hence  so  would  r*.  We  may  further  assume  that 

a,  be  F^^(  A  )  =>  F^,j^(  A  )r'  F^^(b)  =  0,  for  if  not  we  knov/  we  would  have  a 

tangle  and  wo  v/ould  be  done. 

Let  B  =  A*  -  { a  e  A  I  j  be  1-'^^,,  (A  )  a  ae  Fj.*(b)] .  If  we  now  consider  r*  |  B 

wo  have  a  partial  order  such  that  1^4  -  k.  Thus  if  r*  jB  contains  no  non--merge 

points  it  will  conlain  a  cluster  or  tanglco.  But  the  only  way  for  this  to  not 
happen  is  if  a,  be  A*-B  ’>  F^,^. (a)r  F^*(b)  =  0.  But  then  we  vv-ould  have  a 


unless  triis  sot  is  empty,  Bui  if  this  sc  I  is  empty  liicii  a  caniiot  he  a  morge 
point.  Hence  vje  may  assume  r*}  B  has  a  Uingle  or  a  cluster. 

If  r*|B  contains  a  tangle  then  this  is  also  an  r*  tcingJo  unless  the  "not 
comparable"  condition  on  ,  b.  and  b, ,  1-^  foil  if  o  now  sup  for  b^ 
and  bg  rs  less  than  a  new  inf  for  b^  and  But  tlic  first  case  cannot 
happen  by  the  definition  of  a  rc.ntriction  of  a  partial  order.  For  the  second 
case  to  happen  we  would  need  sup  {b,. ,  b^,]  =  inf  [b. ,  bj.}  c.  I'-.*  -  B. 

,w-  ^  j.*  <1  b 

But  this  cannot  happen  since  vve  are  assuming  that  0^2^  ^r* 

disjoint  if  b2,  b^,  u  (  \  )  which  would  be  a  necc^ssary  condition  for  sup 
{b2,  bg]  G  A*  -  B. 

Hence  if  r*|  B  contains  a  tangle  so  does  r*  . 

Now  r*|  B  cannot  contain  a  cluster  unless  it  contains  a  tangle  because  it  is 
a  lattice  if  it  has  no  tangle,  and  every  point  is  a  merge  point.  In  a  lattice 
every  cluster  has  a  unique  follower  and  a  unique  predecessor  and  hence  its 
elements  cannot  be  merge  points. 

Hence  if  r*  contains  only  merge  points  it  coiitalns  a  cluster  or  a  tangle. 


G 


Now  os.Muno  lluii  il  i;  is  Ihe  nuiiiluir  of  bcisc  chains  of  a  lattice.  K£  p  means 
that  tlic  lattice  has  a  c:luster,  taiirilo,  or  ready  pair.  lot  r*  have  pi-1  base  chains. 
Consider  any  rigid  r'^'-nionjo  point  in  and  the  segment  M  -  r*  |{yl?im}  . 

If  M  has  })  or  fewer  ijasc-  chains  then  it  has  a  tangle,  cluster,  or  ready  pair,  and  by 
Thooreni  3  so  does  r*.  It  M  has  at  least  pfl  base  chains  consider  the  set  of  right 
merge  points  of  M  greater  than  m.  Assuming  this  sot  is  non-empty  let  m^  be  a  minimal 
element  of  it  and  lot  Mj^  =  r'-^  K'/l  V  -'mp  .  Mj  C_  M  since  in^  >  m  so  y  >  => 

y  >  m.  If  has  p  or  fewer  base  chains  then  apply  Theorem  3  to  show  r*  has  a 
cluster,  tangle,  or  ready  pair,  so  assume  has  at  least  p+1  base  chains. 

Proceeding  as  above  we  obtain  a  sequence  of  non-trivial  segments  of  r* ,  each  one 
properly  contained  in  the  preceding  one.  Since  r*  is  finite  this  sequence  must 
terminate,  but  it  can  do  so  only  if 

1)  for  some  i,  has  p  or  fewer  base  chains,  or 

2)  for  some  i ,  contains  no  right  merge  points  other 

than  nij. 

In  case  I  an  application  of  Theorem  3  shows  that  r*  has  a  cluster,  tangle,  or  ready 
pair,  so  let  us  consider  case  II.  Let  L  be  the  set  of  left  merge  points  of  M^,  and  let 
X  be  a  minimal  element  of  L.  ^  m^  since  is  not  a  left  merge  point  of  M^,  so 
X  >  m^ .  Also  ,3  a ,  b  e  Mj  <  a  <  x  and  mj<b<X3nda,  b  are  not  comparable 
in  jVIj,  This  Is  true  because  (mj,x)  and  by  the  minimality  of  x;  Thus 

there  is  a  path  from  m^  to  x containing  a  and  no  merge  points,  and  a  path  containing 
b  and  no  merge  points  (by  the  minimality  ofx).  Since  a  and  b  are  not  comparable 
they  cannot  be  on  the  same  base  chain,  so  by  Lemma  4  there  arc  two  base  chains 
connecting  m^  andx,  so  (mj,x)  is  ready,  and  by  Theorem  3  r*  contains  a  cluster, 
tangle,  or  ready  pair.  Our  induction  is  now  complete.  Hence,  the  completion  of  any 
non-trivial  partial  order  contains  a  cluster,  tangle,  or  ready  pair  of  merge  points. 


Q.r.D. 


66. 


APPF.NDIX  C;  Determining  the  Boundries  of  Op Lirnization  Effeclivenoss 

Our  approach  is  to  assumo  that  a  balarccd  program  exists,  move  the 
CU  intorchangeablo  instructions  to  a  single  i’E,  and  compare  execution 
time.  We  first  consider  a  block  of  code  in  Vvdiich  all  the  instructions  are 
Interchangeable.  For  present  purposes  we  .further  assume  that  each  inter- 
changeable  instruction  takes  approximately  the  same  amount  of  time  to 
execute  in  its  respective  proce.s.sor.  Call  average  instruction  times 
Tcu  and  Tpe.  Lot  Npe  =  the  number  of  PE  iru^tructions  and  Ncu  =  the 
number  of  CU  instructions.  Assume  a  balanced  segment  of  code  executed 
in  time  T.  Since  execution  is  balanced:  T=  (.Npo)  (Tpe)  --  (Ncu) (Tcu)  +  Npe; 
that  is,  the  product  of  the  number  of  instructions  and  average  instruction 
time  in  the  PE's  is  equal  to  that  same  product  in  the  CU  plus  an  additional 
tick  to  decode  each  PE  instruction 

Hence;  Ncu  -  (Npe)  (Tpe- 1) 

Tcu 

If  all  the  instructions  were  executed  in  the  PE's,  then  the  total  execution 
time  would  be: 

T=  [Npe+  (Npe)  (Tpe- D]  (Tpe) 

Tcu 

The  ratio  of  balanced  to  single  processor  execution  is: 

(Npe)  (Tpe) _ _  Tcu 

"(Npe)  (Tpe)  fl  +  (Tpc-1)1  "  Tcu+Tpc-1 
Tcu 

The  savings  factor  is :  .  Tcu  „  Tpe-1 

*  Tcu+Tpe-1  Tcu+ Tpe-1 

An  examination  of  the  second  case,  namely  that  a  segment  contains 
PE  exclusive  instructions  (i.e.  SIM  assignments)  and  a  suiflclcnt  number 
of  inter-changeable  instructions  to  balance  them,  yields  the  same  result. 

This  is  due  to  the  fact  that,  regardless  of  what  is  being  executed  in  the 
PE's,  the  interchanaeable  instructions  removed  from  the  CU  end  placed 
in  a  single  PE  will  increa.se  execution  by  the  same  margin.  Hence 
allocation  and  relocation  as  equally  effective  techniques. 

Estimates  of  execution  savings  are  based  on  a  number  of  .simplifying 
assumptions. 

With  regard  to  average  execution  time,  some  instructions  arc 
combined.  For  example,  a  PE  'load'  is  interpreted  as  ti  load  to  Iho  It  • 
register  and  a  route.  CU  average  execution  i:;  doleiTuir.  i  Uir  two  c  ises; 
optimized  and  unoptimlzcd.  For  oi'limizcd  c:-.tcutIon  it  i.a  ossuined  that  a  single 
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load  from  I'l^  .'v.omory  ia  required  for  eight  references  to  local  meniory. 
In  the  unopLiniik^ed  case,  it  is  assumed  that  each  reference  to  local 
memory  rccjuiics  an  adcliiional  load  from  PE  inomory.  In  either  case, 
a  'load'  is  the  combu:od  instructions. 


Interchang oa tdo  Opera tion 

PE 

CU  Optimized 

CU  Unoptimized 

3 

4 

11 

OPERATION 

3 

4 

11 

V  Etc . 

STORE 

3 

4 

11 

Estimated  Savings 

- 

33% 

15% 

[6] 
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